法国专利FR3072800A1 SYNCHRONIZATION IN A MULTI-PAVEMENT PROCESSING ARRANGEMENT

专利PDF首页>>法国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
Treatment system comprising multiple pavers and an interconnection between the pavers. Interconnection is used to communicate within a group of some or all of the blocks in a massive synchronous parallel pattern by which each block of the group performs a calculation phase on the block followed by a phase of inter-pad exchange, the exchange phase being retained until all the blocks of the group have completed the calculation phase. Each block in the group has a local output state upon completion of the calculation phase. The instruction set includes a synchronization instruction to be executed by each block at the completion of its calculation phase to signal a synchronization request to logic in the interconnect. In response to receiving the synchronization request from all the tiles in the group, the logic releases the next exchange phase and also makes available an aggregate state of all the tiles in the group.
公开号:FR3072800A1
申请号:FR1859637
申请日:2018-10-18
公开日:2019-04-26
发明作者:Simon Christian Knowles；Alan Graham Alexander
申请人:Graphcore Ltd；
IPC主号:

专利说明:

Technical Field [0001] The present description relates to the synchronization of the workloads of multiple different tiles in a multi-tile processing arrangement, each tile comprising its own processing unit and its own memory. In particular, the description relates to massive synchronous parallel communication schemes (BSP) in which each block of a group of blocks must complete a calculation phase before any of the blocks of the group can proceed to an exchange phase. .
BACKGROUND ART A multiple thread processor is a processor capable of executing multiple program threads side by side. The processor may include hardware that is common to the multiple different threads (eg, an instruction memory, a data memory, and / or a common thread); but to support multi-wire operation, the processor also includes dedicated hardware specific to each wire.
The dedicated hardware comprises at least one bank of respective context registers for each of the numerous execution threads which can be executed at the same time. A context, when talking about multi-thread processors, refers to the program status of a respective one of the threads running side by side (for example program counter value, state and current values d operands). The context register bank designates the respective set of registers intended to represent this program state of the respective thread. The registers of a register bank are distinct
B17781 FR-408529FR of general purpose memory in that the addresses of the registers are fixed in the form of bits in instruction words, while the memory addresses can be calculated by executing instructions. The registers of a given context typically include a respective program counter for the respective thread of execution, and a respective set of registers of operands to temporarily maintain the data on which one acts and which are supplied by the respective thread during the calculations performed by this thread. Each context can also have a respective state register to store a state of the respective thread (for example if it is paused or running). Thus each of the threads in progress has its own separate program counter, and optionally operand registers and one or more status registers.
One possible form of multi-wire operation is parallelism. That is, as well as multiple contexts, multiple execution pipelines are provided: that is, there is a separate execution pipeline for each instruction flow to run in parallel. However, this requires a large amount of duplication when it comes to hardware.
Therefore, instead, another form of processor with multiple execution threads uses simultaneity rather than parallelism, from which it follows that the threads share a common execution pipeline (or at least a common part of a pipeline) and different threads are interleaved in this same shared execution pipeline. The performance of a multi-thread processor can be further improved compared to non-simultaneous or parallel operation, thanks to improved opportunities to hide pipeline latency. Also, this approach does not require as much additional hardware
B17781 FR-408529FR dedicated to each wire as in a completely parallel processor with multiple execution pipelines, and thus does not require as much additional silicon.
A form of parallelism can be obtained by means of a processor comprising an arrangement of multiple blocks on the same chip (that is to say the same elementary chip), each block respectively comprising separately its own processing unit and its own memory (including program memory and data memory). Thus separate portions of program code can be executed in parallel on different blocks. The blocks are connected to each other via an interconnection on the chip which allows the code executed on different blocks to communicate between the blocks. In some cases, the processing unit of each block can itself execute multiple simultaneous threads on the block, each block having its own set of respective contexts and its corresponding pipeline as described above in order to support the intertwining of multiple wires on the same pad across the same pipeline.
In general, there may be dependencies between the portions of a program executing on different blocks. A technique is therefore necessary to prevent a piece of code on one pad from running ahead of data on which it depends which is made available by another piece of code on another pad. There are a number of possible schemes for achieving this, but the scheme that interests us here is known as massive synchronous parallel (BSP). According to the BSP scheme, each block performs a calculation phase and an exchange phase in an alternative cycle. During the calculation phase, each block performs one or more calculation tasks locally on the block, but does not communicate any of the results of its calculations.
B17781 FR-408529FR to none of the other paving stones. In the exchange phase, each block is authorized to exchange one or more results of the calculations originating from the previous calculation phase with one or more others of the blocks in the group, but does not yet proceed to the next calculation phase. In addition, according to the BSP principle, a barrier synchronization is placed at the join making the transition between the calculation phase and the exchange phase, or at the transition between the exchange phase and the calculation phase, or of them. In other words: either (a) all the blocks must complete their respective calculation phases before any one of the group is authorized to proceed to the next exchange phase, or (b) all the blocks of the group must complete their respective exchange phases before any of the group's blocks are authorized to proceed to the next calculation phase, or (c) both. In certain scenarios, a block performing a calculation may be authorized to communicate with other resources of the system such as a network card or a storage disk, as long as no communication with other blocks in the group is involved.
An example of the use of multi-wire and / or multi-block processing is found in artificial intelligence. As is known to those skilled in the art in the field of artificial intelligence, an artificial intelligence algorithm is based on performing iterative updates of a knowledge model, which can be represented by a graph of multiple interconnected nodes. Each node represents a function of its inputs. Some nodes receive inputs from the graph and some nodes receive inputs from one or more other nodes, while the output of certain nodes forms the inputs of other nodes, and the output of certain nodes provides the output of the graph (and in some cases a given node can
B17781 FR-408529FR even have it all: inputs to the graph, outputs from the graph and connections to other nodes). In addition, the function at each node is parameterized by one or more respective parameters, that is to say weights. During a learning step the goal is, on the basis of a set of experimental input data, to find values for the various parameters so that the graph as a whole generates a desired output for a range of possible entries. Various algorithms for achieving this are known in the art, such as a back propagation algorithm based on a stochastic gradient descent. On multiple iterations based on the input data, the parameters are gradually adjusted to reduce their errors, and thus the graph converges towards a solution. In a subsequent stage, the learned model can then be used to make predictions of outputs given a specified set of inputs or to make an inference regarding inputs (causes) given a specified set of outputs.
The implementation of each node will involve data processing, and the interconnections of the graph correspond to data to be exchanged between the nodes. Typically, at least part of the processing of each node can be carried out independently of some or all of the other nodes of the graph, and therefore large graphs present great opportunities for simultaneity and / or parallelism.
SUMMARY OF THE INVENTION We will describe in the following components of a processor having an architecture which has been developed to respond to problems which arise in the calculations involved in applications of artificial intelligence.
B17781 FR-408529FR
The processor described here can be used as a work accelerator, i.e. it receives a workload from an application running on a host computer, the workload generally being in the form very large data sets to process (such as large experimental data sets used by an artificial intelligence algorithm to learn a knowledge model, or the data from which one must make a prediction or inference using a knowledge acquired beforehand). One goal of the architecture presented here is to process these very large amounts of data with great efficiency. The processor architecture was developed to deal with workloads involved in artificial intelligence. However, it will be clear that the architecture described may also be suitable for other workloads sharing similar characteristics. When executing different portions of a program on multiple blocks, it may be necessary to perform barrier synchronization to bring multiple tiles to a common execution point. It may also be desired to determine a state of the program as a whole after all the blocks have completed a calculation phase, for example to determine if an exception should be reported to the host, or to make a connection decision to determine if you have to connect to a next part of the program or continue the iteration of the current part. For example, if each block of a group of blocks performs the calculations of a respective subgraph of an artificial intelligence graph, it may be desired to determine whether the nodes of the subgraph have all satisfied a certain condition indicating that the graph converges to a solution. To make such a determination in
B17781 FR-408529FR using existing techniques a number of programmed steps are required using general purpose instructions.
It is recognized that it would be desirable to adapt the instruction set of a processor to applications with large-scale multi-wire capacities such as machine learning. According to the present description, this is obtained by providing a dedicated machine code instruction to validate a result of a group of tiles only after all the tiles in the group have completed the current BSP calculation phase, thus ensuring the possibility of synchronizing the tiles and at the same time determining an outcome global multiple threads with reduced latency and lower code density.
According to one aspect described here, there is provided a processing system comprising an arrangement of blocks and an interconnection for communicating between the blocks, in which: each block comprises an execution unit for executing machine code instructions, each being an instance of a predefined set of instruction types in a processor instruction set, each instruction type in the instruction set being defined by a corresponding operation code and zero or more operand fields for take zero or more operands;
the interconnection can be activated to conduct communications in a group of some or all of the blocks according to a massive synchronous parallel scheme, where it follows that each of the blocks of said group performs a calculation phase on the block followed by a inter-block exchange phase, the exchange phase being retained until all the blocks in the group have completed the calculation phase, each block in the group having a local exit state upon completion of the phase of calculation;
B17781 FR-408529FR the instruction set includes a synchronization instruction intended to be executed by each block of the group at the end of its calculation phase, the execution of the synchronization instruction bringing the execution unit to send a synchronization request to hardware logic located in the interconnection; and the logic located in the interconnection is arranged to aggregate the local output states into a global output state, and, in response to the completion of the calculation phase by all the blocks of the group as indicated by the reception of the synchronization request from all the blocks of the group, to store the global exit state in a global exit state register on each of the group blocks, thus making the global exit state accessible by a portion of code According to another aspect described here, a processing system is provided comprising an arrangement of pavers and an interconnection for communicating between the pavers, in which:
each block includes a respective execution unit for executing machine code instructions, each being an instance of a predefined set of instruction types in a processor instruction set, each instruction type being in the set instructions being defined by a corresponding operation code and zero or more operand fields to take zero or more operands;
the interconnection includes synchronization logic in the form of dedicated hardware logic for the coordination in a group of some or all of the blocks;
the instruction set includes a synchronization instruction, the execution unit on each respective block being arranged so that if an instance of the synchronization instruction is executed through the execution unit
B17781 FR-408529FR respectively, in response to the operation code of the synchronization instruction, to cause an instance of a synchronization request to be transmitted from the respective block to the synchronization logic in the interconnection, and to suspend the 'transmission of instructions on the respective block awaiting a synchronization acknowledgment which is received in return from the synchronization logic; each block includes a local exit status register for storing a local exit state from the block upon completion of a respective calculation phase;
the synchronization logic is arranged to aggregate the local output states of the blocks in the group into a global output state; and the synchronization logic is further arranged to, in response to receiving an instance of the synchronization request from all of the blocks in the group, to return the synchronization acknowledgment to each of the blocks in the group and thus allow the issuance of instructions to resume, and to store the global exit state in a global exit state register on each of the group's blocks, thus making the global exit state accessible by a portion of code running on each of the group's tiles.
In embodiments, the execution unit located on each block can be arranged to pause the transmission of instructions in response to the execution of the synchronization instruction; and the logic in the interconnect can be arranged to, in response to receiving the synchronization request from all of the blocks in the group, return a synchronization acknowledgment signal to each of the blocks in the group. group to resume issuing instructions.
B17781 FR-408529EN In embodiments, each of the local output states and global output states can be a single bit.
In embodiments, the aggregation can consist of a Boolean AND of local output states or a Boolean OR of local output states.
In alternative embodiments, the aggregated output state can include at least two bits representing a trinary value, indicating whether the local output states are all true, all false or mixed.
In embodiments, each of the blocks of the group of blocks may include a local output state register arranged to represent the local output state of the block.
In embodiments, each block found in the group can include:
multiple sets of context registers, each set of context registers being arranged to store a program state of the respective one of multiple execution threads; and a scheduler arranged to schedule the execution of a respective one of a plurality of working threads in each of a plurality of time slots in a repeating sequence of interleaved time slots, the program state of each of the threads work being stored in a respective one of the sets of context registers;
in which according to the massive synchronous parallel scheme, the exchange phase is retained until all the working threads on all the blocks of the group have completed the calculation phase;
wherein the local exit state on each pad is an aggregate of an individual exit state provided by each of the work threads on the pad; and
B17781 FR-408529FR in which the code portion can include at least one of the multiple work threads located on the pad.
In embodiments, each block of the group can include hardware logic arranged to perform the aggregation of the individual output states in the local output state.
In embodiments, the instruction set may include an output instruction intended to be included in each of the work threads, and the execution unit may be arranged to provide the individual output state of the respective work thread and to end the respective work thread in response to the operation code of the output instruction.
In embodiments, each of the individual output states and of the local output states can be a single bit, and the aggregation of the individual output states can consist of a Boolean AND of the individual output states or of a boolean OR of the individual output states In embodiments, the local output state can comprise at least two bits representing a trinary value, indicating whether the individual output states are all true, all false, or mixed.
In embodiments, the exchange phase can be arranged to be performed by a supervisor wire separate from the working son, and said at least one wire can include the supervisor wire.
In some embodiments, pausing the transmission of instructions may include at least pausing the transmission of instructions from the supervisor wire while awaiting acknowledgment of receipt. synchronization.
B17781 FR-408529EN In embodiments, the sets of context registers located on each pad can include multiple sets of work thread context registers arranged to represent the respective program state of the plurality of threads of working threads, and a set of additional supervisor context registers comprising an additional set of registers arranged to represent a program state of the supervisor thread.
In embodiments:
the supervisor wire can be arranged to start by executing in each of the time slots;
the instruction set may further comprise an abandonment instruction and the execution unit is arranged so as, in response to the operation code of the abandonment instruction, to abandon the time slot in which the instruction d 'abandonment is performed in the respective work thread; and the output instruction can cause the respective time slot in which the output instruction is executed to be returned to the supervisor wire, so that the supervisor wire resumes execution in the respective slot.
In embodiments, said portion of code can be arranged to use the global output state, once valid, to make a connection decision which depends on the global output state.
In embodiments, the processing system can be programmed to perform an artificial intelligence algorithm in which each node in a graph has one or more respective input edges and one or more respective output edges , the input edges of at least some of the nodes being the output edges of at least some others of the nodes, each node comprising a respective function connecting
B17781 FR-408529FR its output edges to its input edges, each respective function being parameterized by one or more respective parameters, and each of the respective parameters having an associated error, so that the graph converges towards a solution when the errors in some or all of the parameters are reduced; wherein each of the blocks can model a respective subgraph comprising a subset of the nodes of the graph, and each of the local output states can be used to indicate whether the errors in said one or more parameters of the nodes in the respective subgraph have satisfied a predetermined condition.
In embodiments, said group can be selected at least in part by an operand of the synchronization instruction.
In embodiments, the operand of the synchronization instruction can select whether to include only blocks located on the same chip or blocks located on different chips in said group.
In embodiments, the operand of the synchronization instruction can select said group from different hierarchical levels of groupings.
In embodiments, the instruction set can also include an abstention instruction, which causes the block on which the abstention instruction is executed to exclude itself from said group.
According to another aspect described here, there is provided a method of actuating a processing system comprising an arrangement of blocks and an interconnection for communicating between the blocks, in which each block comprises an execution unit for executing machine code instructions, each being an instance of a predefined set of types
B17781 FR-408529FR instructions in a processor instruction set, each type of instruction in the instruction set being defined by a corresponding operation code and zero or more operand fields to take zero or more operands; the process comprising:
conduct communications in a group of some or all of the blocks, via interconnection, according to a massive synchronous parallel scheme, from which it follows that each of the blocks in the group performs a calculation phase on the block followed by a inter-block exchange phase, the exchange phase being retained until all the blocks in the group have completed the calculation phase, each block in the group having a local exit state upon completion of the phase of calculation;
wherein the instruction set comprises a synchronization instruction intended to be executed by each block of the group at the completion of its calculation phase, the execution of the synchronization instruction causing the execution unit to send a request synchronization to hardware logic in the interconnection; and the method comprises, in response to the completion of the calculation phase by all the blocks of the group as indicated by the reception of the synchronization request from all the blocks of the group, the triggering of the logic located in the interconnection to aggregate local exit states into a global exit state, and to store the global exit state in a global exit state register on each of the blocks in the group, thereby making the global exit state accessible to a portion of code executing on each of the group's tiles.
According to another aspect described here, there is provided a computer program product comprising code incorporated in a storage readable by a computer and arranged to run on the processing system of any one
B17781 FR-408529FR of the embodiments described here, the code comprising a portion intended to be executed on each block of the group comprising an instance of the synchronization instruction in each portion.
BRIEF DESCRIPTION OF THE DRAWINGS To facilitate understanding of the present description and to show how embodiments can be implemented, reference will be made, by way of example, to the attached drawings in which:
[Fig. 1] Figure 1 is a block diagram of a multi-wire processing unit;
[Fig. 2] Figure 2 is a block diagram of a plurality of child contexts;
[Fig. 3] FIG. 3 illustrates a diagram of interlaced execution time slots;
[Fig. 4] Figure 4 illustrates a supervisor wire and a plurality of working wires;
[Fig. 5] Figure 5 is a logic block diagram for aggregating output states of multiple wires;
[Fig. 6] FIG. 6 schematically illustrates the synchronization between working wires on the same block;
[Fig. 7] Figure 7 is a block diagram of a processor chip comprising multiple blocks;
[Fig. 8] Figure 8 is a schematic illustration of a massive synchronous parallel computing (BSP) model;
[Fig. 9] Figure 9 is another schematic illustration of a BSP model;
[Fig. 10] Figure 10 is a schematic illustration of BSP between multi-wire processing units;
B17781 FR-408529FR [0048] [Fig. 11] Figure 11 is a block diagram of an interconnection system;
[Fig. 12] Figure 12 is a schematic illustration of a system of multiple interconnected processor chips;
[Fig. 13] Figure 13 is a schematic illustration of a multilevel BSP scheme;
[Fig. 14] Figure 14 is another schematic illustration of a system of multiple processor chips;
[Fig. 15] Figure 15 is a schematic illustration of a graph used in an artificial intelligence algorithm; and [Fig. 53 16] Figure 16 illustrates a wiring example for synchronization between chips.
Detailed description of embodiments In the following, a processor architecture is described which includes a dedicated instruction in its set of instructions for performing barrier synchronization and at the same time aggregating the output states of multiple wires on multiple tiles in a single state aggregated in an output state register, this aggregate output state register existing in each block and containing the same result for each block having been aggregated. However, we will first describe an example of a processor in which this can be incorporated with reference to FIGS. 1 to 4.
FIG. 1 illustrates an example of a processor module 4 according to embodiments of the present description. For example, the processor module 4 can be a block of a matrix of similar processor blocks on the same chip, or can be implemented in the form of a stand-alone processor on its own chip. Processor module 4 includes a unit
B17781 FR-408529FR for multi-fii processing 10 in the form of a barrel processing unit, and a local memory 11 (that is to say on the same block in the case of a multi-block matrix, or the same chip in the case of a single processor chip). A barrel processing unit is a type of multi-wire processing unit in which the pipeline execution time is divided into a repeating sequence of interleaved time slots, each of which may be owned by a given wire. This will be described in more detail in a moment.
The memory 11 includes an instruction memory 12 and a data memory 22 (which can be implemented in various different addressable memory modules or in different regions on the same addressable memory module). The instruction memory 12 stores machine code to be executed by the processing unit 10, while the data memory 22 stores both data on which the executed code will operate and output data produced by the executed code (for example a result of such operations).
The memory 12 stores various different threads of a program, each thread comprising a respective sequence of instructions for performing a certain task or certain tasks. Note that an instruction as designated here designates a machine code instruction, that is to say an instance of one of the fundamental instructions of the processor instruction set, consisting of a single operation code and of zero or more operands.
The program described here comprises a plurality of working son, and a supervisor subroutine which can be arranged in the form of one or more supervisor son. This will be described in more detail in a moment. In embodiments, each of some or all of the working threads has the shape of a respective codelet. A codelet
B17781 FR-408529FR is a special type of thread, sometimes also called an atomic thread. It has all the input information it needs for its execution from the start of the thread (from the moment of launch), i.e. it takes no input from any other part in the program or in memory after being launched. In addition, no other part of the program will use outputs (results) of the wire until it is finished (it ends). Unless he encounters an error, he is guaranteed to finish. It will be noted that certain literatures also define a codelet as being stateless, that is to say that if it is executed twice it will not be able to inherit any information coming from its first execution, but this additional definition does not is not adopted here. It will also be noted that not all of the working threads need to be (atomic) codelets, and in some embodiments some or all of the working threads may instead be able to communicate with each other.
In the processing unit 10, multiple different threads among the threads coming from the instruction memory 12 can be interleaved in a single execution pipeline 13 (although typically only a subset of all the threads stored in the instruction memory can be interleaved at any point in the global program). The multi-thread processing unit 10 comprises: a plurality of banks of registers 26, each arranged to represent the state (context) of a respective different thread among the threads to be executed simultaneously; a shared execution pipeline 13 which is common to the threads executed simultaneously; and a scheduler 24 for scheduling the simultaneous threads for their execution in the shared pipeline in an interlaced manner, preferably in turn. The processing unit 10 is connected to a memory
B17781 FR-408529FR of shared instructions 12 common to the plurality of wires, and to a shared data memory 22 which is still common to the plurality of wires.
The execution pipeline 13 comprises an extraction stage 14, a decoding stage 16, and an execution stage 18 comprising an execution unit which can perform arithmetic and logical operations, address calculations , and loading and storage operations, and other operations, as defined by the instruction set architecture. Each of the context register banks 26 includes a respective set of registers for representing the program state of a respective thread.
[0060]
An example of the registers which each constitute the banks of context registers 26 is illustrated in FIG. 2.
Each of the context register banks 26 includes one or more respective control registers 28, including at least one program counter (PC) for the respective thread (to keep track of the instruction address at which the thread is in running), and in embodiments also a set of one or more status registers (SR) recording a current state of the respective thread (as if it is running or paused, by example since it encountered an error). Each of the context register banks 26 also comprises a respective set of operand registers (OP) 32, for temporarily maintaining operands of the instructions executed by the respective thread, that is to say values on which one operates or resulting from operations defined by the operation codes of the instructions of the respective threads when they are executed. It will be noted that each of the banks of context registers 26 can optionally comprise one or more other types of respective registers (not
B17781 FR-408529FR shown). It will also be noted that although the term bank of registers is sometimes used to designate a group of registers in a common address space, this need not necessarily be the case in the present description and each of the material contexts 26 (each of the sets of registers 26 representing each context) can more generally comprise one or more banks of registers of the kind.
As will be described in more detail below, the arrangement described comprises a bank of working context registers CXO ... CX (Ml) for each of the M wires which can be executed simultaneously (M = 3 in the example illustrated, but this is not limiting), and a bank of additional supervisor context registers CXS. The banks of work context registers are reserved for memorizing the contexts of work threads, and the bank of registers of supervisor context is reserved for memorizing the context of a supervisor wire. It will be noted that in embodiments the supervisor context is special, and that it comprises a different number of registers compared to that of the working threads. Each of the working contexts preferably includes the same number of status registers and operand registers as the others. In embodiments, the supervisor context may include fewer operand registers than each of the work threads. Examples of operand registers that the work context may include and that the supervisor does not include: floating point registers, accumulator registers, and / or dedicated weighting registers (to contain neural network weights ). In embodiments, the supervisor can also include a different number of status registers. Additionally, in embodiments of the game architecture
B17781 FR-408529FR of instructions for processor module 4 can be arranged so that the working threads and the supervisor thread (s) execute different types of instructions but also share certain types of instructions.
The extraction stage 14 is connected so as to extract from the instruction memory 12 instructions to be executed, under the control of the scheduler 24. The scheduler 24 is arranged to control the stage of extraction 14 to extract an instruction from each thread of a set of threads executing simultaneously in turn in a repetitive sequence of time slots, thus dividing the resources of the pipeline 13 into a plurality of time slots interleaved in time, as will be seen describe in more detail in a moment. For example, the scheduling scheme could be a turn or a weighted turn. Another term for a processor operating in this way is a barrel execution thread processor.
In some embodiments, the scheduler 24 may have access to one of the status registers SR of each thread indicating whether the thread is paused, so that the scheduler 24 actually controls the extraction stage 14 for extracting the instructions from only the wires which are currently active. In embodiments, preferably each time slot (and the corresponding context register bank) is always owned by one thread or another, that is to say that each slot is always occupied by a certain thread, and each slot is always included in the sequence of the scheduler 24; although it may happen that the thread occupying a given slot can be paused at this instant, in which case when the sequence comes to this slot, the instruction extraction for the respective thread is skipped. As a variant, it is not excluded for example that
B17781 FR-408529FR In less preferred variant embodiments, certain slots may be temporarily vacant and excluded from the scheduled sequence. When referring to the number of time slots that the thread is capable of interleaving, or the like, this means the maximum number of slots that the thread is capable of running simultaneously, that is that is, the number of simultaneous slots that the thread's hardware supports.
The extraction stage 14 has access to the program counter (PC) of each of the contexts. For each respective thread, the extraction stage 14 extracts the next instruction from this thread from the next address in the program memory 12 as indicated by the program counter. The program counter increments with each execution cycle unless it is bypassed by a branch instruction. The extraction stage 14 then passes the extracted instruction to the decoding stage 16 so that it is decoded, and the decoding stage 16 then passes an indication of the decoded instruction to the execution unit 18 accompanied by the decoded addresses of all the operand registers 32 specified in the instruction, so that the instruction is executed. The execution unit 18 has access to the operand registers 32 and to the control registers 28, which it can use in the execution of the instruction on the basis of the addresses of decoded registers, as in the case of an arithmetic instruction (for example by adding, multiplying, subtracting or dividing the values in two operand registers and providing the result to another operand register of the respective wire). Or if the instruction defines a memory access (loading or storage), the loading / storage logic of the execution unit 18 loads a value from the data memory into an operand register of the respective thread, or memorizes a value from
B17781 FR-408529FR of an operand register of the respective wire in the data memory 22, in accordance with the instruction. Or if the instruction defines a connection or a change of state, the execution unit changes the value in the program counter PC or one of the state registers SR accordingly. It will be noted that while an instruction of a thread is executed by the execution unit 18, an instruction originating from the thread located in the next time slot in the interlaced sequence may be being decoded by the stage of decoding 16; and / or while an instruction is decoded by the decoding stage 16, the instruction originating from the wire being in the next time slot after this one may be being extracted by the extraction stage 14 (although in general the scope of the description is not limited to an instruction by time slot, for example in scenario variants a batch of two or more instructions could be issued by a given thread by time slot). Interlacing thus advantageously masks the latency in the pipeline 13, in accordance with known techniques for processing barrel wires.
An example of the interleaving diagram implemented by the scheduler 24 is illustrated in FIG. 3. Here the simultaneous wires are interleaved according to a turn diagram by which, in each turn of the diagram, the turn is divided in a sequence of time slots S0, SI, S2. „, each to execute a respective thread. Typically, each slot has a length of one processor cycle and the different slots have equal sizes although this is not necessary in all possible embodiments, for example a weighted turn diagram is also possible in which some threads get more cycles than others on each run. In general, the execution of barrel wires can use either an equal turn pattern or a
B17781 FR-408529FR weighted rotation diagram, in the latter case the weighting can be fixed or adaptive.
Whatever the sequence for each execution turn, this pattern is repeated, each turn comprising a respective instance of each of the time slots. It should therefore be noted that a time slot as designated here designates the place allocated repeatedly in the sequence, not a particular instance of the slot in a given repetition of the sequence. In other words, the scheduler 24 divides the execution cycles of the pipeline 13 into a plurality of temporally interleaved execution channels (multiplexed by time separation), each comprising a recurrence of a respective time slot in a sequence repetitive time slots. In the illustrated embodiment, there are four time slots, but this is only for purposes of illustration and other numbers are possible. For example, in a preferred embodiment there are actually six time slots.
Whatever the number of time slots into which the turn-based scheme is divided, then according to the present description, the processing unit 10 comprises a bank of context registers 26 more than the number of time slots , that is, it supports one more context than the number of interlaced time slots it is capable of processing in barrels.
This is illustrated by means of the example in the figure: if there are four time slots S0 ... S3 as shown in Figure 3, then there are five banks of context registers, referenced here CX0, CX1, CX2, CX3 and CXS. That is to say that even if there are only four execution time slots S0 ... S3 in the barrel wire diagram and so
B17781 FR-408529FR only four wires can be executed simultaneously, it is described here to add a fifth bank of CXS context registers, comprising a fifth program counter (PC), a fifth set of operand registers 32, and in embodiments also a fifth set of one or more status registers (SR). Note, however, that as mentioned, in embodiments, the supervisor context may differ from the other CX0 ... 3, and the supervisor thread can support a different set of instructions to activate the execution pipeline 13.
Each of the first four contexts CX0 ... CX3 is used to represent the state of the respective one of a plurality of work threads currently assigned to one of the four execution time slots S0 ... S3, to perform any specific calculation task of an application desired by the programmer (it will also be noted that this may be only the subset of the total number of program work threads as stored in the instruction memory 12) . The fifth context CXS is however reserved for a special function, to represent the state of a supervisor thread (SV) whose role is to coordinate the execution of the work threads, at least in the direction of the assignment of that working threads W which must be executed in such a time slot S0, SI, S2. . . and how well in the overall program. Optionally, the supervisor thread may have other supervisory or coordinating responsibilities. For example, the supervisor wire can be responsible for carrying out barrier synchronizations to ensure a certain order of execution. For example, in the case where one or more second wires depend on data to be supplied by one or more first wires executed on the same processor module 4, the supervisor can perform barrier synchronization
B17781 FR-408529FR to ensure that none of the second wires start before the first wires are finished. In addition or instead, the supervisor can perform barrier synchronization to ensure that one or more wires on the processor module 4 do not start before a certain external data source, such as another pad or chip. processor, has completed the processing required to make data available. The supervisor wire can also be used to perform other functionalities associated with the multiple work wires. For example, the supervisor wire can be responsible for communicating external data to the processor module 4 (to receive external data on which it is necessary to act with one or more of the wires, and / or to transmit data supplied by one or more of the wires. of work). In general, the supervisor wire can be used to provide any kind of supervision or coordination function desired by the programmer. For example, in another example, the supervisor can supervise transfers between the local block memory 12 and one or more resources in the larger system (external to the array 6) such as a storage disk or a network card.
It will of course be noted that four slots constitute only one example, and that in general, in other embodiments there may be other numbers, so that if there is a maximum of M time slots 0 ... Ml per revolution, processor module 4 includes M + l contexts CX ... CX (M-1) &
CXS, i.e. one for each work thread which can be interleaved at any given time and an additional context for the supervisor. For example, in an example implementation there are six time slots and seven contexts.
Referring to Figure 4, the supervisor wire SV does not have its own time slot per se in the diagram of interleaved time slots. The same is true for
B17781 FR-408529FR working threads since the allocation of slots to working threads is defined in a flexible manner. Instead, each time slot has its own dedicated context register bank (CX0 ... CXM-1) to store the work context, which is used by the work thread when the time slot is allocated to the work thread , but not used when the slot is allocated to the supervisor. When a given slot is allocated to the supervisor, instead this slot uses the supervisor's CXS context register bank. Note that the supervisor always has access to his own context and that no work thread is able to occupy the CXS supervisor context register bank.
The supervisor wire SV has the capacity to run in any one and in all the time slots S0 ... S3 (or more generally S0 ... SM-1). The scheduler 24 is arranged for, when the program as a whole starts, to start by allocating to the supervisor wire all of the time slots, that is to say that thus the supervisor SV starts by executing in all the slots S0 ... S3. However, the supervisor thread is provided with a mechanism for, at a certain later point (either immediately or after having performed one or more supervisor tasks), temporarily abandon each of the slots in which it is executed at a respective one of the working wires, for example initially the working wires W0 ... W3 in the example shown in FIG. 4. This is obtained by the fact that the supervisor wire executes an abandon instruction, called RUN by way of example here . In embodiments, this instruction takes two operands: an address of a work thread in the instruction memory 12 and an address of certain data for this work thread in the data memory 22:
RUN task_addr, data_addr.
B17781 FR-408529EN The working threads are portions of code which can be executed simultaneously between them, each representing one or more respective calculation tasks to be carried out. The data address can specify certain data on which the work thread should act. Alternatively, the abandon instruction may take a single operand specifying the address of the work thread, and the address of the data could be included in the code of the work thread; or in another example the single operand could point to a data structure specifying the addresses of the work thread and the data. As mentioned, in embodiments at least some of the working threads can take the form of codelets, i.e., atomic code units executable simultaneously. Alternatively or additionally, some of the working threads need not be codelets and may instead be able to communicate with each other.
The abandonment instruction (RUN) acts on the scheduler 24 so as to abandon the current time slot, in which this instruction is executed itself, at the work wire specified by the operand. Note that it is implicit in the abandonment instruction that it is the time slot in which this instruction is executed that is abandoned (implicit in the context of machine code instructions means that there is no need of an operand to specify this - it is implicitly understood from the operation code itself). So the time slot that is abandoned is the time slot in which the supervisor executes the abandonment instruction. Or put it another way, the supervisor runs in the same space as the one he abandons. The supervisor says to execute this piece of code at this location, then from this point the time slot
B17781 FR-408529FR recurring is (temporarily) owned by the affected work thread.
The supervisor wire SV performs a similar operation in each of one or more of the other time slots, to abandon some or all of its time slots to different respective wires among the working wires W0 ... W3 (selected in a larger set W0 ... wj in the instruction memory 12). Once he has done this for the last slot, the supervisor is suspended (he will resume later where he left when one of the slots is returned by a working thread W).
The supervisor wire SV is thus capable of allocating different work wires, each carrying out one or more tasks, to different slots among the interleaved execution time slots S0 ... S3. When the supervisor thread determines that it is time to execute a work thread, it uses the RUN instruction to allocate this work thread to the time slot in which the RUN instruction has been executed.
In certain embodiments, the instruction set also includes a variant of the execution instruction, RUNALL (execute all). This instruction is used to launch a set of several work threads together, all of them executing the same code. In embodiments, this launches a working thread in each of the slots of the processing unit S0 ... S3 (or more generally
S0 ... S (M-1)).
In addition, in some embodiments, the RUN and / or RUNALL instructions, when executed, also automatically copy a state from one or more of the CXS supervisor state registers (SR) into a or several corresponding status registers of the child (ren)
B17781 FR-408529FR of work launched by the instructions RUN or RUNALL. For example, the copied state can include one or more modes, such as a floating point rounding mode (for example rounded to the nearest or rounded to zero) and / or an overflow mode (for example saturates or uses a value representing infinity). The copied state or mode then controls the work thread in question to operate in accordance with the copied state or mode. In embodiments, the work thread can later overwrite this in its own state register (but cannot change the state of the supervisor). In other variations or additional embodiments, the work threads can choose to read a certain state from one or more supervisor state registers (and again can change their own state later). For example, here again this could consist in adopting a mode from the supervisor's status register, such as a floating point mode or a rounding mode. However, in embodiments, the supervisor cannot read any of the CXO ... context registers of the work threads.
Once launched, each of the currently allocated working threads W0 ... W3 proceeds to carrying out one or more calculation tasks defined in the code specified by the respective abandonment instruction. At the end of this, the respective work thread then returns the time slot in which it is running to the supervisor thread. This is achieved by executing an exit instruction (EXIT).
The EXIT instruction takes at least one operand and preferably a single operand, exit_state (for example a binary value), to be used for any purpose desired by the programmer to indicate a state of the respective codelet at its termination (for example to indicate if a certain
B17781 FR-408529FR condition has been met):
EXIT exit_state The EXIT instruction acts on the scheduler 24 so that the time slot in which it is executed is returned to the supervisor thread. The supervisor wire can then carry out one or more of the following supervision tasks (for example a barrier synchronization and / or an exchange of data with external resources such as other blocks), and / or continue to execute another abandon instruction. to allocate a new working thread (W4, etc.) to the niche in question. It will again be noted that consequently the total number of execution threads in the instruction memory 12 may be greater than the number of threads that the barrel thread processing unit 10 can interleave at any time. It is the role of the supervisor thread SV to plan which of the working threads W0 ... Wj coming from the instruction memory 12, at what stage in the overall program, must be assigned to such a slot of the interleaved time slots S0. ..SM in the schedule diagram of the scheduler 24.
In addition, the EXIT instruction was given another special function, namely that of bringing the output state specified in the operand of the EXIT instruction to be automatically aggregated (by logic dedicated hardware) with the output states of a plurality of other working threads which are executed in the same pipeline 13 of the same processor module 4 (for example the same block). So an additional implicit facility is included in the instruction to end a thread.
An example of a circuit for achieving this is shown in FIG. 5. In this example, the output states of the individual wires and the aggregate output state take
B17781 FR-408529FR each form a single bit, that is to say 0 or 1. The processor module 4 comprises a register 38 for storing the aggregate output state of this processor module 4. This register can be here called the local consensus register $ LC (as opposed to global consensus when the processor module 4 is part of a matrix of similar processor blocks, as will be described in more detail in a moment). In embodiments, this local consensus register $ LC 38 is one of the supervisor state registers in the supervisor context register bank CXS. The logic for performing the aggregation comprises an AND gate 37 arranged to perform a logical AND of (A) the output state specified in the operand of the instruction EXIT and (B) the current value in the local consensus register ($ LC) 38, and to return the result (Q) in the local consensus register $ LC 38 as the new value of the local aggregate.
At an appropriate synchronization point in the program, the value stored in the local consensus register ($ LC) 38 is initially reset to a value of 1. That is, all the wires outgoing execution after this point will contribute to the locally aggregated output state $ LC until the next reset. The output (Q) of the AND gate 37 is at 1 if the two inputs (A, B) are at 1, but otherwise the Q output goes to zero if any of the inputs (A, B) is at 0. Each time an EXIT instruction is executed, its exit status is aggregated with those which arrived previously (since the last reset). Thus by means of the arrangement represented in FIG. 5, the logic maintains a current aggregate of the output states of all the work threads which have ended by means of an EXIT instruction since the last time that the local consensus register ( $ LC) 38 has been reset. In
B17781 FR-408529EN In this example, the current aggregate indicates whether all the execution threads so far have come out true: any output state at 0 coming from any of the working threads will bring the aggregate into register 38 to become locked at 0 until the next reset. In embodiments, the supervisor SV can read the current aggregate at any time by obtaining the current value from the local consensus register ($ LC) 38 (he does not need to wait for synchronization on the pad To do that).
The reinitialization of the aggregate in the local consensus register ($ LC) 38 can be carried out by the supervisor SV by carrying out a PUT in the address of the local consensus register ($ LC) 38 using one or more general purpose instructions, in this example to put a value of 1 in register 38. As a variant, it is not excluded that the reinitialization may be carried out by an automatic mechanism, for example triggered by executing the SYNC instruction described later here.
The aggregation circuit 37, in this case the AND gate, is implemented in a dedicated hardware circuit in the execution unit of the execution stage 18, using any combination of appropriate electronic components to realize the functionality of a boolean AND. A dedicated circuit or hardware designates circuits having a wired function, unlike being programmed by software using general purpose code. The update of the local output state is triggered by the execution of the special instruction EXIT, this being one of the instructions of fundamental machine code in the instruction set of processor module 4, having the inherent functionality of aggregating output states. Also, the local aggregate is stored in a command register 38, that is to say a dedicated storage element (in certain modes
B17781 FR-408529FR embodiment, a single storage bit) whose value can be accessed by the code executing in the pipeline, but which cannot be used by the load-storage unit (LSU) to store general purpose data. Instead, the data function held in a command register is fixed, in this case to the locally aggregated output state storage function. Preferably the local consensus register ($ LC) 38 forms one of the command registers on the processor module 4 (for example on the keypad), the value of which can be accessed by the supervisor by executing a GET instruction and which can be defined by the execution of a PUT instruction.
Note that the circuit shown in Figure 5 is only an example. An equivalent circuit would consist of replacing the AND gate 37 with an OR gate and reversing the interpretation of the output states 0 and 1 by software, i.e. 0 true, 1 false (with the register 38 reset to 0 instead of 1 at each synchronization point). Equivalently, if the AND gate is replaced by an OR gate but the interpretation of the output states is not reversed, and neither is the reset value, then the aggregate state in $ LC will save if one any (rather than all) of the worker states is output with state 1. In other embodiments, the output states need not be single bits. For example, the output state of each individual work thread can be a single bit, but the aggregate output state $ LC can include two bits representing a trinary state: all the work threads are output with state 1 , all the working threads are output with state 0, or the output states of the working threads were mixed. As an example of the logic to implement this, one of the two bits encoding the trinary value can be a Boolean AND (or an OR) of the output states
B17781 FR-408529FR individual, and the other bit of the trinary value can be a Boolean OR of the individual output states. The third coded case, indicating that the output states of the working wires are mixed, can then be formed by the Exclusive OR of these two bits.
The exit states can be used to represent what the programmer wishes, but a particularly envisaged example consists in using an exit state at 1 to indicate that the respective working thread is left with a success or true state, while that an exit state at 0 indicates that the respective work thread has left with a failed or false state (or vice versa if the aggregation circuit 37 performs an OR instead of an AND and the register $ LC 38 is initially reset to 0). For example, if we consider an application where each work thread performs a calculation with an association condition, as a condition indicating whether the error or errors in said one or more parameters of a respective node in the graph of an artificial intelligence algorithm are at an acceptable level according to a predetermined metric. In this case, an individual output state of a given logic level (for example 1) can be used to indicate that the condition is satisfied (for example that the error or errors in said one or more parameters of the node are at an acceptable level according to a certain metric); whereas an individual output state of the opposite logic level (for example 0) can be used to indicate that the condition was not satisfied (for example the error or errors are not at an acceptable level according to the metric in question). The condition can for example be an error threshold placed on a single parameter or on each parameter, or could be a more complex function of a plurality
B17781 FR-408529FR of parameters associated with the respective calculation carried out by the working wire.
In another more complex example, the individual output states of the working wires and the aggregate output state may each comprise two or more bits, which can be used, for example, to represent a degree of confidence in the results of work threads. For example, the output state of each individual work thread can represent a probabilistic measure of confidence in a result of the respective work thread, and the aggregation logic 37 can be replaced by a more complex circuit to achieve probabilistic aggregation individual confidence levels in material form.
Whatever meaning is given by the programmer to the output states, the supervisor wire SV can then obtain the aggregated value from the local consensus register ($ LC) 38 to determine the aggregate output state of all the working threads which have left since its last reset, for example at the level of the last synchronization point, for example to determine whether or not all the working threads have left in a state of success or true. Based on this aggregated value, the supervisor thread can then make a decision in accordance with the designer's choice of design. The programmer can choose to use the locally aggregated output state as he wishes. For example, the supervisor thread can consult the local aggregate output status to determine if a certain portion of the program consisting of a certain subset of work threads has ended as expected or desired. If it is not the case (for example at least one of the working threads left in a failed or false state), it can report to a host processor, or can carry out another iteration of the part of the program including the
B17781 FR-408529FR same working wires; but if it is the case (for example if all the working threads left with a state of success or true) it can instead connect to another part of the program comprising one or more new working threads.
Preferably, the supervisor thread should not access the value found in the local consensus register ($ LC) 38 before all of the working threads in question are out, so that the value stored therein represents l correct updated state of all desired threads. This wait can be imposed by a barrier synchronization performed by the supervisor wire to wait for all the local work wires running simultaneously (i.e. those located on the same processor module 4, running in the same pipeline 13) came out. That is, the supervisor thread resets the local consensus register ($ LC) 38, launches a plurality of working threads, then launches local barrier synchronization (local to the processing module 4, local to a block ) in order to wait for all pending work threads to exit before the supervisor is authorized to proceed to obtaining the aggregated exit status from the local consensus register ($ LC) 38.
Referring to Figure 6, in embodiments a SYNC (synchronization) instruction is provided in the instruction set of the processor. The effect of this SYNC instruction is to cause the supervisor thread SV to wait until all the working threads W executing simultaneously have been output by means of an EXIT instruction. In embodiments the SYNC instruction takes a mode in the form of an operand (in embodiments it is only the operand), the mode specifying whether the SYNC instruction must
B17781 FR-408529FR act only locally in relation to only the working threads which is executed locally on the same processor module 4, for example the same keypad, as the supervisor as part on which the SYNC action is executed (this is -to say only the execution wires located in the same pipeline 13 of the same barrel-wire processing unit 10); or if instead it should be applied on multiple blocks or even on multiple chips.
SYNC mode // mode E {tile, chip, zone_l, zone_2} This will be described in more detail later but with regard to FIG. 6 a local SYNC will be assumed (SYNC tile, that is to say a synchronization in a single block).
The working threads do not need to be identified as operands of the SYNC instruction, since it is implicit that the supervisor SV is then automatically required to wait for none of the time slots S0, SI,. .. of the barrel wire processing unit 10 is occupied by a working wire. As shown in FIG. 6, once all the wires in a current batch of working wires WLn have been launched by the supervisor, the supervisor then executes a SYNC instruction. If the supervisor SV launches working wires W in all the slots S0 ... 3 of the barrel wire processing unit 10 (all four in the example illustrated, but this is only one example of implementation), then the SYNC instruction will be executed by the supervisor once the first element of the current batch of working threads WLn is out, thus returning control of at least one slot to the supervisor SV. Otherwise, if the work threads do not take up all of the slots, the SYNC instruction will simply be executed immediately after the last thread in the current batch WLn has been launched. Either way, the SYNC instruction causes the SV supervisor to
B17781 FR-408529FR wait for all the other elements in the current batch of working wires WLn-1 to execute an EXIT before the supervisor can proceed. It is only after this that the supervisor executes a GET instruction to obtain the content of the local consensus register ($ LC) 38. This wait by the supervisor thread is imposed by the hardware once the SYNC instruction has been executed. That is to say that, in response to the operation code of the SYNC instruction, the logic located in the execution unit (EXU) of the execution stage 18 brings the extraction stage 14 and the scheduler 24 to pause the transmission of instructions from the supervisor thread until all of the pending work threads have executed an EXIT instruction. At some point after obtaining the value of the local consensus register ($ LC) 38 (optionally with another piece of supervisor code in between), the supervisor executes a PUT instruction to reset the local consensus register ($ LC) 38 (at 1 in the example illustrated).
As also illustrated in FIG. 6, the SYNC instruction can also be used to place synchronization barriers between different interdependent layers WL1, WL2, WL3, ... of working wires, where one or more threads in each successive layer is dependent on data produced by one or more working threads in the previous layer. The local SYNC executed by the supervisor thread guarantees that none of the work threads in the next layer WLn + 1 is executed before all the work threads in the immediately preceding layer WLn are out (by executing a EXIT instruction).
As has been mentioned, in embodiments, the processor module 4 can be implemented in the form of a matrix of interconnected blocks forming a
B17781 FR-408529FR multi-block processor, each of the blocks being able to be arranged as described previously in relation to FIGS. 1 to 6.
This is illustrated in FIG. 7 which represents a processor in a single chip 2, that is to say a single elementary chip, comprising a matrix 6 of multiple processor blocks 4 and an interconnection on the chip 34 interconnecting the blocks 4. Chip 2 can be implemented alone in its own integrated circuit package with a single chip, or in the form of multiple elementary chips packaged in the same integrated circuit package, IC.
The interconnection on the chip can also be called here exchange fabric 34 since it allows the blocks 4 to exchange data with one another. Each block 4 comprises a respective instance of the barrel wire processing unit 10 and a memory 11, each arranged as described above in relation to FIGS. 1 to 6. For example, by way of illustration, the chip 2 may comprise on the order of a hundred paving stones 4, or even more than a thousand. To be complete, it will also be noted that a matrix as designated here does not necessarily imply a particular number of dimensions or a particular physical arrangement of the blocks 4.
In embodiments each chip 2 also includes one or more external links 8, allowing the chip 2 to be connected to one or more other external processors on different chips (for example one or more other instances of the same bullet 2). These external links 8 may include any one or more of: one or more chip-to-host links to connect the chip 2 to a host processor, and / or one or more chip to chip links to connect with one or more other instances of chip 2 on the same IC box or the same card, or on different cards. In one
B17781 FR-408529EN example of arrangement, chip 2 receives work from a host processor (not shown) which is connected to the chip via one of the chip-to-host links in the form of data d input to be processed by chip 2, Multiple instances of chip 2 can be connected together in cards by chip-to-chip links. Thus, a host can access a computer which has an architecture comprising a processor in a single chip 2 or comprising multiple processors in a single chip 2 possibly arranged on multiple interconnected cards, depending on the workload required for the application. host.
The interconnection 34 is arranged to allow the various processor blocks 4 located in the matrix 6 to communicate with each other on the chip 2. However, as there may be potentially dependencies between execution threads on the same block 4, there may also be dependencies between the portions of the program running on different blocks 4 in matrix 6. A technique is therefore necessary to prevent a piece of code on a given block 4 from running ahead of data on which it depends which is made available by another piece of code on another block 4.
In certain embodiments, this is obtained by implementing a massive synchronous parallel exchange scheme (BSP), as is illustrated diagrammatically in FIGS. 8 and 9.
According to a version of BSP, each block 4 performs a calculation phase 52 and an exchange phase 50 in an alternating cycle, separated from each other by a barrier synchronization 30 between the blocks. In the illustrated case, a barrier synchronization is placed between each phase of
B17781 FR-408529FR calculation 52 and the next exchange phase 50. During the calculation phase 52 each block 4 performs one or more calculation tasks locally on the block, but does not communicate the results of these calculations to other blocks 4. In the exchange phase 50 each block 4 is authorized to exchange one or more results of the calculations from the previous calculation phase with one or more of the other blocks in the group, but does not perform any new calculation before having received from other blocks 4 the data including its task or tasks dependent. It also does not send data to other blocks, except those calculated in the previous calculation phase. It is not excluded that other operations such as operations associated with internal control can be carried out in the exchange phase. In certain embodiments, the exchange phase 50 does not include non-deterministic calculations over time, but a small number of deterministic calculations over time can optionally be authorized during the exchange phase 50. It will also be noted that a block 4 carrying out a calculation can be authorized during the calculation phase 52 to communicate with other system resources external to the matrix of blocks 4 which is synchronized - for example a network card, a disk drive, or a network of doors programmable on site (FPGA) - as long as it does not involve communication with other blocks 4 in the group which is synchronized. Communication external to the group of blocks can optionally use the BSP mechanism, but as a variant may not use the BSP and instead use another synchronization mechanism of its own.
According to the BSP principle, a barrier synchronization 30 is placed at the join making the transition between the calculation phases 52 and the exchange phase 50, or the join making the transition between the exchange phases 50 and the
B17781 FR-408529FR calculation phase 52, or both. This means that either: (a) all the blocks 4 must complete their respective calculation phases 52 before one of the blocks in the group is authorized to proceed to the next exchange phase 50, or (b ) all the blocks 4 in the group must complete their respective exchange phases 50 before one of the blocks in the group is authorized to proceed to the next calculation phase 52, or (c) these two conditions are imposed. In the three variants it is the individual processors which alternate between the phases, and the overall assembly which synchronizes. The sequence of exchange and calculation phases can then be repeated multiple times. In BSP terminology, each repetition of exchange phase and calculation phase is sometimes called super-step (it should be noted, however, that in the literature terminology is not always used consistently: sometimes each exchange phase and each individual calculation phase is individually called super-steps, while elsewhere, as in the terminology adopted here, the exchange and calculation phases together are called super-steps).
It will also be noted that it is not excluded that multiple groups of independent 4 blocks on the same chip 2 or on different chips may each form a separate respective BSP group operating asynchronously one by compared to the other, the BSP cycle of calculation, synchronization and exchange being imposed only within each given group, but each group doing so independently of the other groups. That is, a multi-block matrix 6 can include multiple internally synchronous groups each operating independently and asynchronously with respect to the others of these groups (described in more detail below). In some embodiments, there is a grouping
B17781 FR-408529FR hierarchical synchronization and exchange, as will be described in more detail below.
FIG. 9 illustrates the BSP principle as it is implemented among a group 4i, 4ii, 4iii of some or all of the blocks of the matrix 6, in the case which requires: (a) synchronization with barrier between the calculation phase 52 and the exchange phase 50 (see above). Note that in this arrangement, some blocks 4 are allowed to start calculating 52 while some others are still exchanging.
According to the embodiments described here, this type of BSP can be facilitated by incorporating additional, special and dedicated functionalities in a machine code instruction for carrying out barrier synchronization, that is to say the instruction SYNC.
In certain embodiments, the SYNC function takes this functionality when it is qualified by an inter-block mode as operand, for example the mode on the chip: SYNC chip.
This is illustrated diagrammatically in FIG. 10. In the case where each block 4 comprises a multi-thread processing unit 10, each phase of calculation 52 of a block can in fact comprise tasks carried out by multiple working threads W on the same block 4 (and a calculation phase 52 given on a given block 4 can comprise one or more layers WL of working wires, which in the case of multiple layers can be separated by internal barrier synchronizations using the SYNC instruction with local mode on the pad as operand, as described above). Once the supervisor thread SV on a given block 4 has launched the last work thread in the current BSP super-step, the supervisor being on this block 4 then performs a
B17781 FR-408529FR SYNC instruction with inter-block mode set as operand: SYNC chip. If the supervisor must launch (RUN) work threads in all the slots of his respective processing unit 10, the SYNC chip is executed as soon as the first slot which is no longer necessary to put other work threads in RUN in the current BSP super-step is returned to the supervisor. For example, this can happen after the first thread has made an EXIT in the last WL, or simply after the first working thread has made an EXIT if there is only one layer. Otherwise, if the slots are not all to be used for work threads executing in the current BSP super-step, the SYNC chip can be executed as soon as the last work thread to be put in RUN in the super step Current BSP has been launched. This can happen after all the work threads in the last layer have been put into RUN, or simply after all the work threads have been put in RUN if there is only one layer.
The execution unit (EXU) of the execution stage 18 is arranged so as, in response to the operation code of the SYNC instruction, when it is qualified by the operand on the chip ( inter-block), cause the supervisor wire in which the SYNC chip was executed to be paused until all the blocks 4 of the matrix 6 have completed the execution of the work wires. This can be used to implement a barrier to the next BSP super-step, i.e. after all the blocks 4 on chip 2 have passed the barrier, the program passing through the blocks as a whole can progress to the next exchange phase 50.
FIG. 11 is a diagram illustrating the logic triggered by a SYNC chip according to the embodiments described here.
B17781 FR-408529EN [0110] Once the supervisor has launched (RUN) all the execution threads that he must launch in the current calculation cycle 52, he executes a SYNC instruction with the inter-block operand, on the chip: SYNC chip. This causes the triggering of the following functionality in the dedicated synchronization logic 39 on block 4, and in a synchronization controller 36 implemented in the hardware interconnection 34. This functionality of both the synchronization logic 39 on the pad and synchronization controller 36 in interconnection 34 is implemented in a dedicated hardware circuit so that, once the SYNC chip is executed, the rest of the functionality proceeds without further instructions being executed for the make.
First, the synchronization logic on block 39 causes the issuance of instructions for the supervisor on block 4 in question to automatically pause (brings the extraction stage 14 and the scheduler 24 to suspend the issuance of instructions from the supervisor). Once all the pending work wires on the local block 4 have performed an EXIT, the synchronization logic 39 automatically sends a synchronization request sync_req to the synchronization controller 36 in the interconnection 34. The local block 4 then continues to wait with the issuance of supervisor instructions on pause. A similar process is also implemented on each of the other blocks 4 in the matrix 6 (each comprising its own instance of the synchronization logic 39). Thus, at a certain point, once all the final work threads in the current calculation phase 52 have made an EXIT on all the blocks 4 of the matrix 6, the synchronization controller 36 will have received a respective synchronization request ( sync_req) from all the blocks 4 of the matrix 6. It is only then,
B17781 FR-408529FR in response to reception of the sync_req from each block 4 of the matrix 6 on the same chip 2, that the synchronization controller 36 sends a sync_ack synchronization acknowledgment signal to the synchronization logic 39 on each of the blocks 4. Up to this point, each of the blocks 4 has had its transmission of supervisor instructions paused awaiting the synchronization acknowledgment signal (sync_ack). Upon reception of the sync_ack signal, the synchronization logic 39 located in block 4 automatically ends the pause in the transmission of supervisor instructions for the respective supervisor wire on this block 4. The supervisor is then free to proceed to an exchange of data with other blocks 4 via the interconnection 34 in a next exchange phase 50.
Preferably the sync_req and sync_ack signals are sent and received to and from the synchronization controller, respectively, via one or more dedicated synchronization wires connecting each block 4 to the synchronization controller 36 in the interconnection 34.
In addition, according to embodiments described here, additional functionality is included in the SYNC instruction, that is to say, that at least when it is executed in an inter-block mode (by example SYNC chip), the SYNC instruction also causes the local output states $ LC of each of the synchronized blocks 4 to be automatically aggregated into additional dedicated hardware 40 in the interconnection 34. In the embodiments shown, this logic takes the shape of an AND gate with multiple inputs (one input for each block 4 of the matrix 6), for example formed from a chain of AND doors with two inputs 40i, 40ii, ... as shown example in Figure 11. This
B17781 FR-408529FR inter-block aggregation logic 40 receives the value found in the local output status register (local consensus register) $ LC 38 from each block 4 of the matrix - in embodiments each being a single bit and aggregates these values into a single value, for example an AND of all locally aggregated output states. Thus the logic forms a globally aggregated output state on all the execution threads on all the blocks 4 of the matrix 6.
Each of the blocks 4 comprises a respective instance of a global consensus register ($ GC) 42 arranged to receive and store the global output state coming from the global aggregation logic 40 in the interconnection 34. In some embodiments this is another of the state registers found in the CXS context register bank of the supervisor. In response to the synchronization request (sync_req) received from all the blocks 4 of the matrix 6, the synchronization controller 36 causes the output of the aggregation logic 40 (for example the output of the AND) to be stored in the register of global consensus ($ GC) 42 on each block 4 (it will be noted that the switch represented in FIG. 11 is a schematic representation of the functionality and that in fact the update can be implemented by any appropriate digital logic). This register $ GC 42 is accessible by the supervisor wire SV on the respective block 4 once the transmission of supervisor instructions is resumed. In some embodiments, the global consensus register $ GC is implemented in the form of a command register in the command register bank so that the supervisor wire can obtain the value found in the global consensus register ($ GC) 42 using a GET instruction. It will be noted that the synchronization logic 36 waits for the sync_req to be received from all the blocks 4 before updating the value in one
B17781 FR-408529FR any of the global consensus registers ($ GC) 42, otherwise an incorrect value can be made available to a supervisor wire on a block which has not yet completed its part of calculation phase 52 and which is therefore still running.
The aggregate aggregate output state $ GC allows the program to determine an overall output of parts of the program executing on multiple different blocks 4 without having to individually examine the state of each individual work thread on each block individual. It can be used for any purpose desired by the programmer. For example, in the example shown in Figure 11 where the global aggregate is a Boolean AND, this means that any entry at 0 leads to an aggregate of 0, but if all the entries are at 1 then the aggregate is worth 1. C that is, if a 1 is used to represent a true exit or a success, it means that if any of the local exit states of one of the blocks 4 is false or in failure, then the global aggregate state will also be false or will represent a failed exit. For example, this could be used to determine if the portions of code running on all tiles have all met a predetermined condition. So the program can query a single register (in some embodiments a single bit) to ask is something went wrong Yes or no or have all the nodes of the graph reached an acceptable level of error Yes or no Rather than having to examine the individual states of the individual work threads on each individual pad (and again, in embodiments the supervisor is actually not able to query the state of the working wires except by output status registers 38, 42). In other words, each of the EXIT and
B17781 FR-408529FR
SYNC reduces multiple individual output states to a single combined state.
In an example of a use case, the supervisor located on one or more of the blocks can report to a host processor if the global aggregate has indicated an exit at false or in failure. In another example, the program can make a connection decision based on the overall output state. For example, the program examines the aggregate aggregate output state $ GC and on the basis of this determines whether it should continue to loop or should branch elsewhere. If the global output state $ GC is always false or in failure, the program continues its iteration of the same first part of the program, but once the global output state $ GC is true or successful, the program branches to a second, different part of the program. The connection decision can be implemented individually in each supervisor wire, or by the fact that one of the supervisors takes the role of master and gives instructions to the other slave supervisors on the other blocks (the master role being configured by software ).
Note that the aggregation logic 40 shown in Figure 11 is only an example. In another equivalent example, the AND can be replaced by an OR, and the interpretation of 0 and 1 can be reversed (0 -> · true, 1 -> false). Equivalently if the AND gate is replaced by an OR gate but the interpretation of the output states is not reversed, and neither is the reset value, then the aggregate state in $ GC will save if any (instead of all) of the blocks is output with the locally aggregated state 1. In another example, the global output state $ GC can comprise two bits representing a trinary state: all the locally aggregated output states $ LC of pavers had state 1, all output states aggregated locally
B17781 FR-408529FR $ LC of the blocks had state 0, or the locally aggregated output states $ LC of the blocks were mixed. In another more complex example, the local exit states of blocks 4 and the globally aggregated exit state may each comprise two or more bits, which may be used, for example, to represent a degree of confidence in the results of the blocks 4. For example, the locally aggregated output state $ LC of each individual block can represent a statistical probabilistic measure of confidence in a result of the respective block 4, and the global aggregation logic 40 can be replaced by a more complex to perform a statistical aggregation of individual confidence levels by material.
As mentioned previously, in some embodiments multiple instances of the chip 2 can be connected together to form an even larger array of blocks 4 spanning multiple chips 2. This is illustrated in Figure 12. Some or all of the chips 2 can be implemented on the same IC package or some or all of the chips 2 can be implemented on different IC packages. The chips 2 are connected to each other by an external interconnection 72 (via the external links 8 shown in FIG. 7). In addition to providing a conduit for exchanging data between blocks 4 located on different chips, the external exchange device 72 also provides hardware support for achieving barrier synchronization between blocks 4 located on different chips 2 and aggregate the local output states of the blocks 4 located on the different chips 2.
In certain embodiments, the SYNC instruction can take at least one other possible value from its mode operand to specify an external synchronization,
B17781 FR-408529FR ie inter-chip: SYNC zone n, where zone n represents an external synchronization zone. The external interconnection 72 includes hardware logic similar to that described in relation to FIG. 11, but on an external interpuce scale. When the SYNC instruction is executed with an external synchronization area of two or more chips specified in its operand, this causes the logic located in the external interconnection to operate in a similar manner to that described in relation to the internal interconnection 34, but in all of the blocks on the multiple 2 different chips in the specified synchronization area.
That is to say that, in response to an external SYNC, the transmission of instructions from the supervisor is paused until all the blocks 4 on all the chips 2 in the synchronization zone external have completed their calculation phase 52 and have submitted a synchronization request. In addition, the logic located in the external interconnection 72 aggregates the local output states of all these blocks 4, in the set of multiple chips 2 in the zone in question Once all the blocks 4 in the synchronization zone external have made the synchronization request, the external interconnection 72 sends a synchronization acknowledgment signal to blocks 4 and stores the aggregate aggregate output state at chip level in the global consensus registers ($ GC) 42 of all 4 blocks in question. In response to the synchronization acknowledgment, blocks 4 on all chips 2 in the area resume the transmission of instructions to the supervisor.
[0121]
In embodiments, the functionality of the interconnection 72 can be implemented in chips 2, that is to say that the logic can be distributed between the chips
B17781 FR-408529FR so that only wired connections between chips are required (Figures 11 and 12 are schematic).
All the blocks 4 located in the mentioned synchronization zone are programmed to indicate the same synchronization zone via the mode operand of their respective SYNC instructions. In embodiments, the synchronization logic in the device
of interconnection external 72 East arranged of so if that not' East not the case due a th error of programming or ij I a other error (like a fault of memory parity),
some or all of the blocks 4 will not receive an acknowledgment, and therefore the system will stop at the next external barrier, thus allowing an external management CPU (for example the host) to intervene for debugging or system recovery. In other embodiments an error is reported in the event that the synchronization areas do not match. However, preferably the compiler is arranged to ensure that the tiles in the same area all indicate the same correct synchronization area at the time concerned.
FIG. 13 illustrates an example of a BSP program flow involving both internal synchronization (on the chip) and external synchronization (inter-chip). As shown, it is preferable to keep the internal exchanges 50 (of data between blocks 4 on the same chip 2) separate from the external exchanges 50 '(of data between blocks 4 on different chips 2). One reason for this is that a global exchange between multiple chips, which is bounded by global synchronization, can be more costly in terms of latency and load balancing complexity than in the case of only synchronization and exchange. at the chip level. Another possible reason is that data exchange via the internal interconnection 34
B17781 FR-408529FR (on the chip) can be made deterministic over time, while in embodiments of the data exchanges via the external interconnection 72 can be non-deterministic over time. In such scenarios it may be useful to separate internal and external exchanges so that the synchronization and external exchange process does not contaminate internal synchronization and exchange.
Consequently, to obtain such a separation, in embodiments the program is arranged to carry out a sequence of synchronizations, exchange phases and calculation phases comprising in the following order: (i) a first calculation phase, then (ii) internal barrier synchronization 30, then (iii) internal exchange phase 50, then (iv) external barrier synchronization 80, then (v) external exchange phase 50 ' . See the chip of 211 in FIG. 13. The external barrier 80 is imposed after the internal exchange phase 50, so that the program does not carry out the external exchange 50 'until after the internal exchange 50. It will also be noted that as shown with regard to the chip of 21 in FIG. 12, optionally a calculation phase can be included between the internal exchange (iii) and the external barrier (iv). The global sequence is imposed by the program (for example by being generated as such by the compiler), and the internal synchronization and the exchange do not extend to blocks or other entities on another chip 2. The sequence (i) - (v) (with the optional calculation phase mentioned above between iii and iv) can be repeated in a series of global iterations. By iteration there can be multiple instances of internal calculation, synchronization and exchange (i) - (iii) before external synchronization and exchange.
B17781 FR-408529EN [0125] It will be noted that during an external exchange 50 the communications are not limited to being only external: certain blocks can carry out only internal exchanges, some can carry out only external exchanges, and some can carry out a mixture of of them. It will also be noted that, as shown in FIG. 13, it is generally possible to have a zero calculation phase 52 or a zero exchange phase 50 in any given BSP super-step.
In certain embodiments, as also shown in FIG. 13, certain blocks 4 can perform local inputs / outputs during a calculation phase, for example they can exchange data with a host.
As illustrated in FIG. 14, in embodiments the mode of the SYNC instruction can be used to specify one of multiple different possible external synchronization zones, for example zone_l or zone_2. In embodiments, this corresponds to different hierarchical levels. That is to say that each upper hierarchical level 92 (for example zone 2) encompasses two or more zones 91A, 91B of at least one lower hierarchical level. In some embodiments, there are only two hierarchical levels, but higher numbers of nested levels are not excluded. If the operand of the SYNC instruction is set to the lower hierarchical level of the external synchronization zone (SYNC zone_l), then the synchronization and aggregation operations described above are carried out in relation to blocks 4 on chips 2 only in the same lower level external synchronization area as the block on which the SYNC was executed. If, on the contrary, the operand of the SYNC instruction is set to the higher hierarchical level of the zone of
B17781 FR-408529FR external synchronization (SYNC zone_2), then the synchronization and aggregation operations described above are carried out automatically in relation to all the blocks on all the chips 2 in the same external synchronization zone of higher level as the block on which the SYNC was executed. In some embodiments the highest hierarchical level of synchronization area encompasses all of the bullets, i.e., it is used to achieve overall synchronization. When multiple lower level zones are used, a BSP can be imposed internally in group of blocks 4 on the chip (s) 2 in each zone, but each zone can operate asynchronously with respect to the others until that a global synchronization is carried out.
It will be noted that in other embodiments, the synchronization zones which can be specified by the mode of the SYNC instruction are not limited to being hierarchical in nature. In general, a SYNC instruction can be provided with modes corresponding to any sort of grouping. For example, modes can allow selection from only non-hierarchical groups, or a mixture of hierarchical groupings and one or more non-hierarchical groups (where at least one group is not entirely nested within another). This advantageously gives flexibility for the programmer or the compiler, with a minimum code density, to select between different arrangements of internally synchronous groups which are asynchronous with each other.
An example of a mechanism for implementing synchronization between the selected synchronization groups 91, 92 is illustrated in FIG. 16. As illustrated, the external synchronization logic 76 in
B17781 FR-408529FR the external interconnection 72 comprises a respective synchronization block 95 associated with each respective chip 2. Each synchronization block 95 comprises respective door logic and a respective synchronization aggregation device. The gate logic comprises hardware circuits which connect the chips 2 together in a daisy chain topology for the purpose of synchronization and aggregation of output states, and which propagate the synchronization and the output state information of the next way. The synchronization aggregation device comprises hardware circuits arranged to aggregate the synchronization requests (sync_req) and the output states as follows.
The respective synchronization block 95 associated with each chip 2 is connected to its respective chip 2, so that it can detect the synchronization request (Sync_req) sent by this chip 2 and the output state of this chip 2, and so that it can send back the synchronization acknowledgment (Sync_ack) and the overall output state to the respective chip 2. The respective synchronization block 95 associated with each chip 2 is also connected to the synchronization block 95 of at least one other of the chips 2 via an external synchronization interface comprising a bundle of four synchronization wires 96, of which details will be described more precisely in a moment. This can be part of one of the chip-to-chip links 8. In the case of a link between chips 2 on different cards, the interface 8 can for example comprise a PCI interface and the four synchronization wires 96 can be implemented by reusing four wires from the PCI interface. Some of the synchronization blocks 95 of the chips are connected to those of two adjacent chips 2, each connection being made by
B17781 FR-408529FR via a respective instance of the four synchronization wires 96. In this way, the chips 2 can be connected in one or more strings via their synchronization blocks 95. This allows requests synchronization, synchronization acknowledgments, current aggregates of output states, and global output states, are propagated up and down the chain.
In operation, for each synchronization group 91, 92, the synchronization block 95 associated with one of the chips 2 in this group is put as master for the purpose of synchronization and aggregation of output states, the rest of the group being slaves for that. Each of the slave synchronization blocks 95 is configured with the direction
(for example left or right) in which he needs spread of the requests synchronization, of the accused of reception of offi sation and states of exit for
each synchronization group 91, 92 (that is to say the direction to the master). In certain embodiments these settings are configurable by software, for example in an initial configuration phase after which the configuration remains in place after the operation of the system. For example this can be configured by the host processor. As a variant, it is not excluded that the configuration can be wired. In any case, the different synchronization groups 91, 92 can have different masters and in general it is possible that a given chip 2 (or rather its synchronization block 95) is the master of a group and not of a another group of which she is a member, or be the master of multiple groups.
For example, by way of illustration, let us consider the example scenario in FIG. 16. Let us say by way of example that the synchronization block 95 of the 2IV chip
B17781 FR-408529FR is put as master of a given synchronization group 91A. Now consider the first chip 21 in the chain of chips 2, connected via their synchronization blocks 95 and wires 96 last to the chip 2IV. When all the work threads of the current calculation phase on the first chip 21 have executed an EXIT instruction, and the supervisors on all the blocks 4 (participants) have all executed a SYNC instruction specifying the synchronization group 91A, then the first puce signals that it is ready for synchronization at its respective associated synchronization block. The chip also provides chip level synchronization (the aggregate of all the outgoing work threads on all the participating blocks on the respective chip 21).
In response, the synchronization block 95 of the first chip propagates a synchronization request (Sync req) to the synchronization block 95 of the next chip 211 in the chain. It also propagates the output state of the first chip 21 to the synchronization block 95 of this next chip 211. The synchronization block 95 of this second chip 211 waits for the supervisors of its own blocks 4 (participants) all to have executed a SYNC instruction specifying the synchronization group 91A, bringing the
211 to indicate that it is ready for synchronization. It is chip synchronization 95 propagates a request for synchronization to the synchronization block 95 of the chip a current aggregate of the output state of the first chip 21 with that of the second chip 211. If the second chip 211 was become ready for synchronization before the first
21, then the synchronization block 95 of the second chip
211 would have waited for the first chip 21 to signal a request
B17781 FR-408529FR synchronization before propagating the synchronization request to the synchronization block 95 of the third chip 2III. The synchronization block 95 of the third chip 2III behaves in a similar manner, this time aggregating the current aggregate output state from the second chip 211 to obtain the next current aggregate to go forward, etc. This continues towards the master synchronization block, that of the 2IV chip in this example.
The synchronization block 95 of the master then determines a global aggregate of all the output states on the basis of the current aggregate it receives and the output state of its own 2IV chip. It propagates this global aggregate backwards along the chain towards all the chips 2, accompanied by the acknowledgment of (Sync_ack).
If the master is in contrast to being in the aforementioned example, synchronization and state of opposite directions of each synchronization reception halfway in a chain, at a given end as then the output information spread in sides of the master, on both sides towards the master. In this case, the master only issues the synchronization acknowledgment and the overall output status once the synchronization request from both sides has been received. For example, consider the case where chip 2III is master of group 92. In addition, in embodiments the synchronization block 95 of some of the chips 2 could be connected to that of three or more other chips 2, thus creating multiple branches of chains in the direction of the master. Each chain then behaves as described above, and the master only issues the synchronization acknowledgment and the overall exit status once the synchronization request has been received from all the chains. And / or, one or more of the chips 2
B17781 FR-408529FR could be connected to an external resource such as the host processor, a network card, a storage device or an FPGA.
In certain embodiments, the signaling of the synchronization and of the output state information is implemented in the following manner. The bundle of four synchronization wires 96 between each pair of chips 2 comprises two pairs of wires, a first pair 96_0 and a second pair 96_1. Each pair includes an instance of a synchronization request thread and an instance of a synchronization acknowledgment thread. To signal a current aggregate output state of value 0, the synchronization block 95 of the transmitting chip 2 uses the synchronization request wire of the first pair of wires 96_0 when signaling the synchronization request (sync_req), or for signaling a current aggregate of value 1 the synchronization block 95 uses the synchronization request wire of the second pair of wires 96_1 when signaling the synchronization request. To signal a global aggregate output state of value 0, the synchronization block 95 of the transmitting chip 2 uses the synchronization acknowledgment wire of the first pair of wires 96_0 when signaling the acknowledgment of receipt. synchronization (sync_ack), or to signal a global aggregate of value 1 the synchronization block 95 uses the synchronization request wire of the second pair of wires 96_1 when signaling the synchronization acknowledgment.
It will be noted that what has just been mentioned is only the mechanism intended for the propagation of the synchronization and of the output state information. The actual data (content) is transmitted by another channel, for example as described below in
B17781 FR-408529EN reference to FIG. 16. In addition, it will be noted that this is only an example of implementation, and those skilled in the art will be able to construct other circuits to implement the described synchronization and aggregation functionality once given the specification of this functionality described here. For example, the synchronization logic (95 in Figure 18) could instead use packets transported on interconnection 34, 72 as an alternative to dedicated wiring. For example, sync_req and / or sync_ack could each be transmitted as one or more packets.
The functionality of the SYNC instruction in the different possible modes is summarized as follows.
SYNC tile (performs synchronization with a local barrier on a block) • the supervisor's execution mode passes from execution to waiting for the working wires to come out • suspending the issuance of instructions for the supervisor wire until all work threads are inactive • when all work threads are inactive, the aggregate work output status is made available through the local consensus register ($ LC) 38.
SYNC chip (performs an internal barrier synchronization on a chip) • the supervisor's execution mode passes from execution to waiting for the working wires to come out • suspending the issuance of instructions for the supervisor wire until all the working threads are inactive • when all the working threads are inactive:
- the output state of the aggregated local work thread is made available through the local consensus register ($ LC) 38
B17781 FR-408529FR
- participation in internal synchronization is signaled to the exchange fabric 34
- the supervisor remains inactive until block 4 receives an acknowledgment of internal synchronization from the exchange fabric 34
- the exit status at system level is updated in the global consensus register ($ GC) 42.
SYNC zone_n (performs synchronization with an external barrier in the whole of zone n) • the supervisor's execution mode passes from execution to waiting for working wires to come out • suspending the issuance of instructions for the supervisor thread until all the working threads are inactive • when all the working threads are inactive:
- the aggregated local work thread output status is made available through the local consensus register ($ LC) 38
a participation in the external synchronization is signaled to the external system, for example the synchronization logic in the above-mentioned external interconnection 72
- the supervisor remains suspended until block 4 receives an external synchronization acknowledgment from the external system 72
- the system-level exit status is updated in the global consensus register ($ GC) 42.
As mentioned previously, all of the blocks 4 need not necessarily participate in the synchronization. In embodiments, as described, the group of participating tiles can be defined by the mode operand of the synchronization instruction. However, this only allows the selection of predefined tile groups. It is recognized here that it would also be desirable to be able to select participation in the
B17781 FR-408529FR synchronization block by block. Consequently, in embodiments, an alternative or additional mechanism is provided for selecting the individual blocks 4 which participate in the barrier synchronization.
In particular, this is obtained by providing an additional type of instruction in the instruction set of the processor, to be executed by one or some of the blocks 4 in place of the SYNC instruction. This instruction can be called abstention instruction, or SANS instruction (automatic non-participatory synchronization at the start). In embodiments, the SANS instruction is reserved for use by the supervisor wire. In embodiments it takes a single immediate operand:
WITHOUT n_barriers [0143] The behavior of the instruction WITHOUT is to cause the block on which it is executed to abstain from synchronization with current barrier, but without retaining the other blocks which are currently waiting for all the tiles in the specified synchronization group execute SYNC. Indeed, she says go on without me. When the SANS instruction is executed, the operation code of the SANS instruction triggers the logic in the execution unit of the execution stage 18 to send an instance of the synchronization request signal (Sync_req) to the controller. internal and / or external synchronization 36, 76 (depending on the mode). In embodiments, the synchronization request generated by the SANS instruction applies to any synchronization group 91, 92 which includes block 4 which executed the SANS instruction. That is, whatever the synchronization group that blocks 4 in the local chip (s) then use (they must agree on the synchronization group), the sync req
B17781 FR-408529FR from those who executed the SANS instruction will still be valid.
Thus, from the point of view of the logic of the synchronization controller 36, 76 and of the other blocks 4 in the synchronization group, the block 4 executing the SANS instruction appears exactly as being a block 4 executing a SYNC instruction, and does not retain the synchronization barrier and the sending of the synchronization acknowledgment signal (Sync_ack) coming from the synchronization logic 36, 76. That is to say that the blocks 4 executing the instruction WITHOUT instead of the SYNC instruction do not retain or block any of the other blocks 4 involved in a synchronization group of which the block in question is otherwise a member. Any handshake performed by a
WITHOUT instruction East valid for all groups of synchronization 91, 92. [0145] However, to the difference of the SYNC instruction, instruction WITHOUT born not cause setting on break from
issuing instructions from the supervisor while waiting for the synchronization acknowledgment signal (Sync_ack) from synchronization logic 36, 76. Instead, the respective block can simply continue without being inhibited by barrier synchronization current carried out between the other blocks 4 which have executed SYNC instructions. Thus, by imitating a synchronization but without waiting, the SANS instruction allows its block 4 to continue processing one or more tasks while still allowing the other blocks 4 to synchronize.
The operand n_barriers specifies the number of synchronizations posted, that is to say the number of future synchronization points (barriers) in which the block will not participate. As a variant, it is not excluded that in
B17781 FR-408529FR other embodiments the SANS instruction does not take this operand, and that instead each execution of the SANS instruction causes only one abstention once.
By means of the SANS instruction, certain blocks 4 may be responsible for carrying out tasks outside the direct scope of the BSP operating diagram. For example, it may be desirable to allocate to a small number of blocks 4 in a chip 2 the initiation (and processing) of data transfers to and / or from the host memory while the majority of blocks 4 are busy with the primary computing task (s). In such scenarios, these blocks 4 which are not directly involved in a primary calculation can declare themselves as effectively disconnected from the synchronization mechanism for a period of time using the automatic non-participatory synchronization (SANS) functionality. When using this functionality, a block 4 is not obliged to actively signal (i.e. by executing the SYNC instruction) its ability to be ready for synchronization (for any of the synchronization zones), and in embodiments makes zero contribution to the aggregate output state.
The SANS instruction begins or extends a period during which block 4 on which it is executed will abstain from active participation in inter-block synchronization (or synchronization with other external resources if they are also involved in synchronization). During this period, this block 4 will automatically signal its ability to be ready for synchronization, in all the zones, and in embodiments will also make a zero contribution to the aggregate aggregate consensus $ GC. This period of time can be
B17781 FR-408529FR expressed in the form of an immediate unsigned operand (n_barriers) indicating how many future additional synchronization points will be signaled automatically by this block 4. At the execution of the instruction WITHOUT, the value n_barriers specified by its operand is placed in a countdown register $ ANS_DCOUNT on the respective block 4. This is an architectural state element used to keep track of the number of future additional sync_reqs that need to be done. If the automatic non-participatory synchronization mechanism is currently inactive, the first affirmation of capacity to be ready (synchronization request, sync req) will be
Following statements will occur in the background, once the previous synchronization is completed following the statement of
Synchronization acknowledgment, sync ack).
If the automatic non-participatory synchronization mechanism is currently active, the countdown counter register $ ANS_DCOUNT will be updated automatically, so that no synchronization acknowledgment signal is left without counting. The automatic non-participatory synchronization mechanism is implemented in dedicated hardware logic, preferably with an instance of it in each block 4, although in other embodiments it is not excluded that at the place it can be implemented centrally for a group of pavers or all pavers.
With regard to the behavior of the output state, there are in fact a certain number of possibilities which depend on the implementation. In embodiments, to obtain the globally aggregated output state, the synchronization logic 36, 76 aggregates only the local output states originating from blocks 4 being in the group of
B17781 FR-408529FR specified synchronization which executed a SYNC instruction, and not of the one or those which executed an instruction WITHOUT (the block (s) which abstain). As a variant, the globally aggregated output state is obtained by aggregating the local output states originating from all the blocks 4 which are in the synchronization group which has executed a SYNC and those which have executed an instruction WITHOUT (both 4 participants and 4 abstainers). In the latter case, the output of the local output state by the block (s) 4 which abstain for global aggregation may be the locally effective aggregate output state of the working threads of this block at the time of the execution of the SANS instruction, just as with the SYNC instruction (see the description of the local consensus register of $ LC 38), As a variant, the local exit state produced by block 4 which abstains can be a default value, for example the true value (for example the logical state 1) in embodiments where the output state is binary. This prevents the abstaining block 4 from interfering with the overall exit state in embodiments where a false local exit state causes the global exit state to be false.
Regarding the return of the global exit state, there are two possibilities, regardless of whether the abstaining block submits or not a local exit state to produce the local aggregate, and independently whether this value is an effective value or a default value. That is to say that, in one implementation, the global aggregate output state produced by the synchronization logic 36, 76 located in the interconnection 34, 72 is stored only in the global consensus registers $ GC 42 of the 4 participating blocks which executed a SYNC instruction, and not of the 4 abstaining blocks which instead executed a SANS instruction. In embodiments, both
B17781 FR-408529FR places a default value is memorized in the global consensus register $ GX 42 of the block (s) 4 which have executed an instruction WITHOUT (the blocks which abstain). For example, this default value can be true, for example logic state 1, in the case of a binary global output state. However, in an implementation variant, the effective global aggregate produced by the synchronization logic 36, 76 is stored in the global consensus registers $ GC 42 of both the 4 participant blocks which have executed SYNC instructions and 4 abstaining blocks which instead executed a SANS instruction. Thus all the blocks of the group can still have access to the globally aggregated output state.
FIG. 15 illustrates an example of application of the processor architecture described here, namely an artificial intelligence application.
As is well known to those skilled in the art in the technique of artificial intelligence, artificial intelligence begins with a learning step in which the artificial intelligence algorithm learns a knowledge model. . The model includes a graph of interconnected nodes (that is to say vertices) 102 and of edges (that is to say of links) 104. Each node 102 in the graph comprises one or more edges of entry and one or more exit stops. Some of the input edges of some of the nodes 102 are the output edges of some of the other nodes, thereby connecting the nodes to each other to form the graph. In addition, one or more of the input edges of one or more of the nodes 102 form the inputs of the graph as a whole, and one or more of the output edges of one or more of the nodes 102 form the outputs of the graph in his outfit. Sometimes a given node can even have all of this: inputs from the graph, outputs from the graph and
B17781 FR-408529FR connections to other nodes. Each stop 104 communicates a value or more often a tensor (n-dimensional matrix), this forming the inputs and outputs supplied to the nodes and obtained from the nodes 102 on their input and output edges respectively.
Each node 102 represents a function of its one or more inputs received on its input edge (s), the result of this function being the output (s) provided on the output edge (s). Each function is parameterized by one or more respective parameters (sometimes called weights or weights, although they do not necessarily have to be multiplier weights). In general, the functions represented by the different nodes 102 can take different forms of function and / or can be parameterized by different parameters.
In addition, each of said one or more parameters of each function of a node is characterized by a respective error value. In addition, a respective condition can be associated with the error or errors in the parameter or parameters of each node 102. For a node 102 representing a function parameterized by a single parameter, the condition can be a simple threshold, that is that is, the condition is satisfied if the error is within the specified threshold but is not satisfied if the error is beyond the threshold. For a node 102 configured with more than one respective parameter, the condition for this node 102 to have reached an acceptable level of error can be more complex. For example, the condition can be satisfied only if each of the parameters of this node 102 remains below the respective threshold. In another example, a combined metric can be defined as combining the errors in the different parameters for the same node 102, and the condition can be satisfied if the value of the combined metric remains
B17781 FR-408529FR below a specified threshold, but otherwise the condition is not satisfied if the value of the combined metric is beyond the threshold (or vice versa depending on the definition of the metric). Whatever the condition, this gives a measure of whether the error in the node parameter (s) remains below a certain level or degree of acceptability. In general any suitable metric can be used. The condition or the metric can be the same for all the nodes, or can be different for certain respective different nodes.
In the learning step, the algorithm receives experimental data, that is to say multiple data points representing different possible combinations of entries in the graph. As experimental data is received, the algorithm gradually adjusts the parameters of the various nodes 102 of the graph on the basis of the experimental data so as to try to minimize errors in the parameters. The goal is to find parameter values such that the output of the graph is as close as possible to a desired output for a given input. When the graph as a whole tends towards such a state, we say that the graph converges. After an appropriate degree of convergence the graph can be used to make predictions or inferences, that is, to predict an exit for a certain given entry or to infer a cause for a certain given exit.
The learning stage can take a number of different possible forms. For example, in a supervised approach, the experimental input data takes the form of training data, that is to say inputs which correspond to known outputs. With each data point, the algorithm can adjust the parameters so that the output more closely matches the
B17781 FR-408529FR known output for the given input. In the next prediction stage, the graph can then be used to map an input query to an approximate predicted output (or vice versa if an inference is made). Other approaches are also possible. For example, in an unsupervised approach, there is no concept of reference result per input data, and instead we let the artificial intelligence algorithm identify its own structure in the output data. Or, in a reinforcement approach, the algorithm tries at least one possible output for each data point in the experimental input data, and it is told whether its output is positive or negative (and potentially the degree to which it is positive or negative), e.g. won or lost, or reward or punishment, or the like. On many tests, the algorithm can gradually adjust the parameters of the graph to be able to predict inputs which will lead to a positive output. The various approaches and algorithms for learning a graph are known to those skilled in the art in the field of artificial intelligence.
According to an example of application of the techniques described here, each work thread is programmed to perform the calculations associated with a respective individual node among the nodes 102 in an artificial intelligence graph. In this case, at least some of the edges 104 between the nodes 102 correspond to the exchanges of data between wires, and some may involve exchanges between blocks. In addition, the individual output states of the working threads are used by the programmer to represent whether the respective node 102 has satisfied or not its respective condition for the convergence of the parameter or parameters of this node, that is to say if the error in the parameter (s) remains in the level or
B17781 FR-408529FR the acceptable region in the error space. For example, there is an example of using the embodiments in which each of the individual output states is an individual bit and the aggregate output state is an AND of the individual output states (or equivalently an OR if 0 is taken as positive); or wherein the aggregate output state is a trinary value representing whether the individual output states were all true, all false or mixed. Thus, by examining a single register value in the output state register 38, the program can determine whether the whole graph, or at least one sub-region of the graph, has converged to an acceptable degree.
In another variant of this, it is possible to use embodiments in which the aggregation takes the form of a statistical aggregation of individual confidence values. In this case, each individual output state represents a confidence (for example a percentage) that the parameters of the node represented by the respective thread have reached an acceptable degree of error. The aggregate output state can then be used to determine an overall confidence level indicating whether the graph, or a sub-region of the graph, has converged to an acceptable degree.
In the case of a multi-block arrangement 6, each block executes a subgraph of the graph. Each subgraph includes a supervisor routine comprising one or more supervisor threads, and a set of work threads in which some or all of the work threads can take the form of codelets.
In such applications, or indeed in any graph-based application where each working thread is used to represent a respective node in a graph, the codelet that each working thread comprises can be
B17781 FR-408529FR defined as a software procedure acting on the persistent state and the inputs and / or outputs of a vertex, in which the codelet:
• is launched on a work thread register context, to execute in a barrel slot, by the supervisor thread executing a run instruction;
• runs until completion without communication with other codelets or the supervisor (except for the return to the supervisor when the codelet leaves);
• has access to the persistent state of a vertex via a memory pointer provided by the run instruction, and to a non-persistent work area in memory which is private for this barrel slot; and • executes an EXIT as its last instruction, after which the barrel slot it used is returned to the supervisor, and the exit state specified by the exit instruction is aggregated with the local exit state of the keypad which is visible to the supervisor.
Updating a graph (or a sub-graph) means updating once each constituent vertex, in any order consistent with the causality defined by the edges. Updating a vertex means executing a codelet on the state of the vertex. A codelet is an update procedure for vertices - a codelet is usually associated with many vertices. The supervisor executes a RUN per vertex instruction, each of these instructions specifying a vertex state address and a codelet address.
Note that the above embodiments have been described only by way of example.
For example, the applicability of the mechanism for aggregating exit states is not limited to the architecture
B17781 FR-408529FR which has been described previously in which a separate context is provided for the supervisor thread, or in which the supervisor thread is executed in a slot and then abandons its slot to a working thread. In another arrangement, for example, the supervisor can run in his own dedicated window.
In addition, the terms supervisor and work thread do not imply specific responsibilities unless this is explicitly mentioned, and in particular are not necessarily limited in themselves to the diagram described above in which a supervisor thread abandons its time slot to a working thread and so on. In general, a work thread can designate any thread to which a calculation task is allocated. The supervisor can represent any kind of supervision or coordination thread responsible for actions such as: assigning working wires to barrel slots, and / or performing barrier synchronizations between multiple wires, and / or performing any operation of flow control (like a connection) depending on the output of more than a single wire.
When reference is made to a sequence of interlaced time slots, or the like, this does not necessarily imply that the sequence mentioned consists of all the possible or available slots. For example, the sequence in question could consist of all possible slots or only those which are currently active. It is not necessarily excluded that there may be other potential slots which are not currently included in the planned sequence.
The term paved as it is used here is not necessarily limited to a particular or similar topography, and in general can designate any modular unit
B17781 FR-408529FR
of resources treatment, including a unit of treatment 10 and a memory 1 1 correspondent, in a matrix of similar modules, which typically at least some are on the same chip (i.e. the even chip
elementary).
In addition, the scope of the present description is not limited to a deterministic internal interconnection over time or an external non-deterministic interconnection over time. The synchronization and aggregation mechanisms described here can also be used in a completely deterministic time arrangement, or a completely non-deterministic time arrangement.
Furthermore, when reference is made to performing synchronization or aggregation in a group of tiles, or between a plurality of tiles or the like, this need not necessarily designate all the tiles on the chip or all the blocks in the system unless explicitly stated. For example, the SYNC and EXIT instructions could be arranged to perform synchronization and aggregation only in relation to a certain subset of blocks 4 on a given chip and / or only a subset of chips 2 in a given system ; while some other blocks 4 on a given chip, and / or some other chips in a given system, may not be involved in a given BSP group, and could even be used for a completely separate set of tasks not related to the gui calculation is made by the group at hand.
Also, while certain SYNC instruction modes have been described here, the scope of the present description more generally is not limited to such modes. For example, the list of modes given previously is not necessarily exhaustive. Or in other embodiments,
B17781 FR-408529EN the SYNC instruction may have fewer modes, for example the SYNC does not need to support different hierarchical levels of external synchronization, or does not need to distinguish between synchronizations on the chip and between chips (that is to say in an inter-block mode, always acts in relation to all the blocks whether on the chip or off-chip). In still other alternative embodiments, the SYNC instruction does not need to take a mode as an operand at all. For example, in some embodiments separate versions of the SYNC instruction (different operation codes) may be provided for different levels of synchronization and aggregation of output states (such as different SYNC instructions for synchronization on the blocks and inter-block, synchronization on the chips). Or in other embodiments, a dedicated SYNC instruction can be provided only for inter-block synchronizations (leaving synchronization at the block level between the wires, if necessary, to be carried out by general-purpose software).
In addition, the synchronization zones are not limited to being hierarchical (that is to say nested one inside the other), and in other embodiments, the selectable synchronization zones can be formed of or include one or more non-hierarchical groups (all the tiles of this group not nested in a single other selectable group).
In addition, the synchronization diagrams described here do not exclude the implication, in embodiments, of external resources other than multi-block processors, for example a CPU processor like the host processor, or even a or several components which are not processors such as one or more network cards, storage devices and / or FPGAs. For example, some pavers may
B17781 FR-408529FR choose to engage in data transfers with an external system, these transfers forming the calculation load of this block. In this case, transfers should be completed before the next barrier. In some cases, the exit status of the block may depend on the result of the communication with the external resource, and this resource may indirectly influence the exit state. Alternatively or in addition, resources other than multi-block processors, for example the host or one or more FPGAs, could be incorporated into the synchronization network itself. That is, a synchronization signal such as a Sync_req is required from this or these additional resources so that the synchronization barrier is satisfied and the blocks proceed to the next exchange phase. Furthermore, in embodiments the aggregate global output state may include in the aggregation an output state of the external resource, for example from an FPGA.
Other applications and variants of the techniques described may appear to those skilled in the art with the description given here. The field of the present description is not limited by the embodiments described but only by the appended claims.

权利要求:
Claims (23)
[1" id="c-fr-0001]
1. A processing system comprising an arrangement of paving stones and an interconnection for communicating between the paving stones, in which:
each block includes an execution unit for executing machine code instructions, each being an instance of a predefined set of instruction types in a processor instruction set, each instruction type being in the set of instructions being defined by a corresponding operation code and zero or more operand fields to take zero or more operands; the interconnection can be activated to conduct communications in a group of some or all of the blocks according to a massive synchronous parallel scheme, where it follows that each of the blocks of said group performs a calculation phase on the block followed by a inter-block exchange phase, the exchange phase being retained until all the blocks in the group have completed the calculation phase, each block in the group having a local exit state upon completion of the phase of calculation;
the instruction set includes a synchronization instruction intended to be executed by each block of the group at the end of its calculation phase, the execution of the synchronization instruction causing the execution unit to send a request for synchronization with hardware logic located in the interconnection; and the logic located in the interconnection is arranged to aggregate the local output states into a global output state, and, in response to the completion of the calculation phase by all the blocks of the group as indicated by the reception of the synchronization request from all the blocks in the group, to memorize the output state
B17781 FR-408529FR global in a global output status register on each of the group's blocks, thus making the global output state accessible by a portion of code executing on each of the group's blocks.
[2" id="c-fr-0002]
2. The processing system as claimed in claim 1, in which the execution unit located on each block is arranged to pause the emission of instructions in response to the execution of the synchronization instruction; and wherein the logic in the interconnect is arranged to, in response to receiving the synchronization request from all of the blocks in the group, return a synchronization acknowledgment signal to each of the blocks group to resume issuing instructions.
[3" id="c-fr-0003]
3. The processing system as claimed in claim 1, in which each of the local output states and of the global output states is a single bit.
[4" id="c-fr-0004]
4. Processing system according to claim 3, in which the aggregation consists of a Boolean AND of the local output states or of a Boolean OR of the local output states.
[5" id="c-fr-0005]
5. Processing system according to claim 1 or 2, in which the aggregated output state comprises at least two bits representing a trinary value, indicating whether the local output states are all true, all false or mixed.
[6" id="c-fr-0006]
6. Processing system according to claim 1, in which each of the blocks of the group of blocks comprises a local output state register arranged to represent the local output state of the block.
[7" id="c-fr-0007]
7. Processing system according to any one of the preceding claims, in which each block being found
B17781 FR-408529FR in the group includes:
multiple sets of context registers, each set of context registers being arranged to store a program state of the respective one of multiple execution threads; and a scheduler arranged to schedule the execution of a respective one of a plurality of working threads in each of a plurality of time slots in a repeating sequence of interleaved time slots, the program state of each of the threads work being stored in a respective one of the sets of context registers; in which according to the massive synchronous parallel scheme, the exchange phase is retained until all the working threads on all the blocks of the group have completed the calculation phase;
wherein the local exit state on each pad is an aggregate of an individual exit state provided by each of the work threads on the pad; and wherein said portion of code includes at least one of the multiple work threads located on the pad.
[8" id="c-fr-0008]
8. The processing system as claimed in claim 7, in which each block of the group comprises hardware logic arranged to carry out the aggregation of the individual output states into the local output state.
[9" id="c-fr-0009]
9. Processing system according to claim 8, in which the instruction set comprises an output instruction intended to be included in each of the work threads, the execution unit being arranged to provide the individual output state of the respective work thread and to end the respective work thread in response to the operation code of the output instruction.
B17781 FR-408529FR
[10" id="c-fr-0010]
10. Treatment system according to claim 7, 8 or
9, wherein each of the individual exit states and local exit states is a single bit, and the aggregation of the individual exit states consists of a Boolean AND of the individual exit states or a Boolean OR of the states individual output.
[11" id="c-fr-0011]
11. Treatment system according to claim 7, 8 or
9, wherein the local exit state comprises at least two bits representing a trinary value, indicating whether the individual exit states are all true, all false, or mixed.
[12" id="c-fr-0012]
12. Processing system according to any one of claims 7 to 11, in which the exchange phase is arranged to be carried out by a supervisor wire separate from the working wires, and said at least one wire comprises the supervisor wire.
[13" id="c-fr-0013]
13. Processing system according to claim 12, in its dependence on claim 2, in which the pause of the transmission of instructions comprises at least the pause of the transmission of instructions from the supervisor wire pending synchronization acknowledgment.
[14" id="c-fr-0014]
The processing system according to claim 12 or 13, wherein the sets of context registers located on each pad include: multiple sets of work thread context registers arranged to represent the respective program state of the threads a plurality of work threads, and a set of additional supervisor context registers comprising an additional set of registers arranged to represent a program state of the supervisor thread.
B17781 FR-408529FR
[15" id="c-fr-0015]
15. Processing system according to claim 14, in which:
the supervisor thread is arranged to start by executing in each of the time slots;
the instruction set further comprises an abandon instruction and the execution unit is arranged so as, in response to the operation code of the abandon instruction, to abandon the time slot in which the instruction abandonment is performed in the respective work thread; and the output instruction causes the respective time slot in which the output instruction is executed to be returned to the supervisor wire, so that the supervisor wire resumes execution in the respective slot.
[16" id="c-fr-0016]
16. Processing system according to any one of the preceding claims, programmed with said code; wherein said portion of code is arranged to use the global output state, once valid, to make a connection decision which depends on the global output state.
[17" id="c-fr-0017]
17. Processing system according to any one of the preceding claims, programmed to carry out an artificial intelligence algorithm in which each node being in a graph comprises one or more input points of at least some of the nodes being the edges some others of the nodes, each node comprising a respective one connecting its output edges to its input edges, each parameterized by one or more respective parameters, and each of the respective parameters having an associated error, so that the graph converges towards a solution when errors in some or all of the parameters occur
B17781 FR-408529FR reduce;
wherein each of the blocks models a respective subgraph comprising a subset of the nodes of the graph, and each of the local output states is used to indicate whether the errors in said one or more parameters of the nodes in the respective subgraph have satisfies a predetermined condition.
[18" id="c-fr-0018]
18. Processing system according to any one of the preceding claims, in which the group is chosen at least in part by an operand of the synchronization instruction.
[19" id="c-fr-0019]
19. The processing system as claimed in claim 18, in which the operand of the synchronization instructions selects whether to include only tiles located on the same chip or tiles located on different chips in said group.
[20" id="c-fr-0020]
20. The processing system as claimed in claim 18 or 19, in which the operand of the synchronization instructions selects said group from different hierarchical levels of groupings.
[21" id="c-fr-0021]
21. A processing system according to any one of the preceding claims, in which the instruction set further comprises an abstention instruction, which causes the block on which the abstention instruction is executed to exclude itself from said group. .
[22" id="c-fr-0022]
22. Method for actuating a processing system comprising an arrangement of blocks and an interconnection for communicating between the blocks, in which each block comprises an execution unit for executing machine code instructions, each being an instance of a predefined set of instruction types in a processor instruction set, each instruction type of the
B17781 FR-408529FR instruction set being defined by a corresponding operation code and zero or more operand fields to take zero or more operands; the method comprising: conducting communications in a group of some or all of the blocks, via the interconnection, according to a massive synchronous parallel scheme, from which it follows that each of the blocks of the group performs a calculation phase on the block followed by an inter-block exchange phase, the exchange phase being retained until all the blocks in the group have completed the calculation phase, each block in the group having a local exit state upon completion the calculation phase;
wherein the instruction set comprises a synchronization instruction intended to be executed by each block of the group at the completion of its calculation phase, the execution of the synchronization instruction causing the execution unit to send a request synchronization to hardware logic in the interconnection; and the method comprises, in response to the completion of the calculation phase by all the blocks of the group as indicated by the reception of the synchronization request from all the blocks of the group, the triggering of the logic located in the interconnection to aggregate local exit states into a global exit state, and to store the global exit state in a global exit state register on each of the blocks in the group, thereby making the global exit state accessible to a portion of code executing on each of the group's tiles.
[23" id="c-fr-0023]
23. A computer program product comprising code incorporated in a storage readable by a computer and arranged to execute on the processing system of any one of claims 1 to 21, the code comprising a portion intended to be executed on each
B17781 FR-408529FR group box including an instance of the synchronization instruction in each portion.

类似技术:

公开号 | 公开日 | 专利标题

FR3072800A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVEMENT PROCESSING ARRANGEMENT

FR3072799A1|2019-04-26|COMBINING STATES OF MULTIPLE EXECUTIVE WIRES IN A MULTIPLE WIRE PROCESSOR

FR3072797A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVING AND MULTI-CHIP TREATMENT ARRANGEMENT

FR3072798A1|2019-04-26|ORDERING OF TASKS IN A MULTI-CORRECTION PROCESSOR

FR3072801A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVEMENT PROCESS MATRIX

KR102190879B1|2020-12-14|Synchronization amongst processor tiles

KR102183118B1|2020-11-25|Synchronization in a multi-tile processing arrangement

US20170351530A1|2017-12-07|Approximate synchronization for parallel deep learning

FR3090924A1|2020-06-26|EXCHANGE OF DATA IN A COMPUTER

Singla2019|Scalable Distributed Safety Verification using Actor Architecture

EP1493083A2|2005-01-05|Reconfigurable control system based on hardware implementation of petri graphs

同族专利:

公开号 | 公开日

TW201923556A|2019-06-16|

JP6797881B2|2020-12-09|

WO2019076714A1|2019-04-25|

DE102018126004A1|2019-04-25|

CN110214317A|2019-09-06|

US20190121641A1|2019-04-25|

JP2019079528A|2019-05-23|

TWI700634B|2020-08-01|

CA3021416A1|2019-04-20|

GB2569269B|2020-07-15|

KR102262483B1|2021-06-08|

KR20190044570A|2019-04-30|

US10564970B2|2020-02-18|

CA3021416C|2021-03-30|

US20200089499A1|2020-03-19|

GB201717291D0|2017-12-06|

GB2569269A|2019-06-19|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US8046727B2|2007-09-12|2011-10-25|Neal Solomon|IP cores in reconfigurable three dimensional integrated circuits|

US8866827B2|2008-06-26|2014-10-21|Microsoft Corporation|Bulk-synchronous graphics processing unit programming|

US8539204B2|2009-09-25|2013-09-17|Nvidia Corporation|Cooperative thread array reduction and scan operations|

US8713290B2|2010-09-20|2014-04-29|International Business Machines Corporation|Scaleable status tracking of multiple assist hardware threads|

US20120179896A1|2011-01-10|2012-07-12|International Business Machines Corporation|Method and apparatus for a hierarchical synchronization barrier in a multi-node system|

KR101863181B1|2012-01-20|2018-05-31|지이 비디오 컴프레션, 엘엘씨|Coding concept allowing parallel processing, transport demultiplexer and video bitstream|

US10067768B2|2014-07-18|2018-09-04|Nvidia Corporation|Execution of divergent threads using a convergence barrier|

US9747108B2|2015-03-27|2017-08-29|Intel Corporation|User-level fork and join processors, methods, systems, and instructions|

US10310861B2|2017-04-01|2019-06-04|Intel Corporation|Mechanism for scheduling threads on a multiprocessor|

US10672175B2|2017-04-17|2020-06-02|Intel Corporation|Order independent asynchronous compute and streaming for graphics|GB2569098B|2017-10-20|2020-01-08|Graphcore Ltd|Combining states of multiple threads in a multi-threaded processor|

DE102018205390A1|2018-04-10|2019-10-10|Robert Bosch Gmbh|Method and device for error handling in a communication between distributed software components|

DE102018205392A1|2018-04-10|2019-10-10|Robert Bosch Gmbh|Method and device for error handling in a communication between distributed software components|

GB2575294B|2018-07-04|2020-07-08|Graphcore Ltd|Host Proxy On Gateway|

GB2580165B|2018-12-21|2021-02-24|Graphcore Ltd|Data exchange in a computer with predetermined delay|

GB2591106B|2020-01-15|2022-02-23|Graphcore Ltd|Control of data transfer between processors|

法律状态:
2019-10-15| PLFP| Fee payment|Year of fee payment: 2 |

2020-10-29| PLFP| Fee payment|Year of fee payment: 3 |

2021-10-27| PLFP| Fee payment|Year of fee payment: 4 |

2021-10-29| PLSC| Publication of the preliminary search report|Effective date: 20211029 |

优先权:

申请号 | 申请日 | 专利标题

GB1717291.7A|GB2569269B|2017-10-20|2017-10-20|Synchronization in a multi-tile processing arrangement|

GB1717291.7|2017-10-20|

[返回顶部]